Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.

Motivation

A variety of different sources contribute different types of pollutants to what we call air pollution. Some sources are natural while others are anthropogenic (human derived):

Major types of air pollutants

  1. Gaseous - Carbon Monoxide (CO), Ozone (O3), Nitrogen Oxides(NO, NO2), Sulpher Dioxide (SO2)
  2. Particulate - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
  3. Dust - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
  4. Biological - pollen, bacteria, viruses, mold spores

See [here])http://www.redlogenv.com/worker-safety/part-1-dust-and-particulate-matter) for more detail on the types of pollutants in the air.

Particulate pollution

Air pollution particulates are generally described by their size.

There are 3 major categories:

  1. Large Coarse Particulate Mater - has diameter of >10 micrometers (10 µm)

  2. Coarse Particulate Mater (called PM10-2.5) - has diameter of between 2.5 µm and 10 µm

  3. Fine Particulate Mater (called PM2.5) - has diameter of < 2.5 µm

PM10 includes any particulate mater <10 µm (both coarse and fine particulate mater)

Here you can see how these sizes compare with a human hair:

source

The following plot and table show the relative sizes of these different pollutants in micrometers(µm):

source

This table shows how deeply some of the smaller fine particles can penetrate within the human body:

Negative Impact of Particulate Exposure on Health

Exposure to air pollution is associated with higher rates of mortality in older adults and is known to be a risk factor for many diseases and conditions including but not limited to:

  1. Asthma - fine particle exposure (PM2.5) was found to be associated with higher rates of asthma in children
  2. Inflammation in type 1 diabetes - fine particle exposure (PM2.5) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with type 1 diabetes
  3. Lung function and emphysema - higher concentrations of ozone (O3), nitrogen oxides (NOx), black carbon, and fine particle exposure PM2.5 , at study baseline were significantly associated with greater increases in percent emphysema per 10 years
  4. Low birthweight - fine particle exposure(PM2.5) was associated with lower birth weight in full-term live births
  5. Viral Infection - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (PM2.5)

See this review article for more information about sources of air pollution and the influence of air pollution on health.

Sparse Monitoring is Problematic for Public Health

Historically epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country. However as can be seen in the following figure, these monitors remain to be relatively sparse in certain regions of the country. Furthermore, dramatic differences in pollution rates can be seen even within the same city.

source

This lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations.

Machine Learning Offers a Solution

An article published in the Environmental Health journal dealt with this issue by using data about population density, road density, among other features to model or predict air pollution levels at a more localized scale using machine learning methods.

Yanosky, J. D. et al. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. Environ Health 13, 63 (2014).

The authors of this article state that:

“Exposure to atmospheric particulate matter (PM) remains an important public health concern, although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or county-specific ambient concentrations.”

The article above explains that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. We will use similar methods to predict annual air pollution levels spatially within the US.

Main Questions

Our main question:

  1. Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Learning Objectives

In this case study, we will walk you through importing data from CSV files and performing machine learning methods to predict our outcome variable of interest (in this case annual fine particle air pollution estimates). We will especially focus on using packages and functions from the Tidyverse, and more specifically the tidymodels package/ecosystem primarily developed and maintained by Max Kuhn and Davis Vaughan. This package loads more modeling related packages like rsample, recipes, parsnip, yardstick, and dials. We will also briefly cover the workflows and tune packages. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

We will begin by loading the packages that we will need:

Package Use
here to easily load and save data
readr to import the CSV file data
dplyr to view/arrange/filter/select/compare specific subsets of the data
skimr to get an overview of data
summarytools to get an overview of data in a different style
magrittr to use the %<>% pipping operator
corrplot to make large correlation plots
ggcorrplot also to make large correlation plots
GGally to make smaller correlation plots
rsample to split the data into testing and training sets and to split the training set for cross-validation
recipes to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are recipe() , prep() and various transformation step_*() functions, as well as juice() - extracts final preprocessed training data and bake() - applies recipe steps to testing data). See here for more info.
parsnip an interface to create models (major functions are fit(), set_engine())
yardstick to evaluate the performance of models
broom to get tidy output for our model fit and performance
ggplot2 to make visualizations with multiple layers
dials to specify hyper-parameter tuning
tune to perform cross validation, tune hyper-parameters, and get performance metrics
workflows to create modeling workflow to streamline the modeling process
vip to create variable importance plots
randomForest to perform the random forest analysis
stringr to manipulate the text the map data
tidyr to separate data within a column into multiple columns
rnaturalearth to get the geometry data for the earth to plot the US
maps to get map database data about counties to draw them on our US map
sf to convert the map data into a data frame
lwgeom to use the sf function to convert the map geographical data
rgeos to use geometry data
cowplot to allow plots to be combined

The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.

Context

The State of Global Air is a report released every year to communicate the impact of air pollution on public health.

The State of Global Air 2019 report which uses data from 2017 stated that:

Air pollution is the fifth leading risk factor for mortality worldwide. It is responsible for more deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity. Each year, more people die from air pollution–related disease than from road traffic injuries or malaria.

The report also stated that:

In 2017, air pollution is estimated to have contributed to close to 5 million deaths globally — nearly 1 in every 10 deaths.

##### [source]

The State of Global Air 2018 report using data from 2016 which separated different types of air pollution, found that particulate pollution was particularly associated with mortality.

The 2019 report shows that the highest levels of fine particulate pollution occurs in Africa and Asia and that:

More than 90% of people worldwide live in areas exceeding the World Health Organization (WHO) Guideline for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

Looking at the US specifically, air pollution levels are generally improving. The US Environmental Protection Agency (EPA) also releases a report about air pollution levels called Our Nation’s Air.

[source]

However, air pollution continues to contribute to health risk for Americans, in particular in regions with higher than national average rates of pollution that actually at time exceed the world health organization’s recommended level. Thus it is necessary to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.

You can see that current air quality conditions at this website and you will notice variation across different cities.

Here were the conditions in Topeka Kansas when this was written:

It reports particulate values using what is called the Air Quality Index scale (AQI), this calculator indicates that 114 AQI is equivalent to 40.7 ug/m3 and is considered unhealthy for sensitive individuals. Thus some areas very much exceed the World Health Organization (WHO) annual exposure guideline (10 ug/m3) at certain times and this may adversely affect the health of people living in these locations.

Furthermore, adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines. Secondly, it appears that the composition of the particulate mater and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. See this article for more details.

The monitor data that we will use in this case study comes from a system of monitors in which roughly 90% are located within cities. Thus there is an equity issue in terms of capturing the air pollution levels of more rural areas. Therefore, to get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be very useful to estimate air pollution levels in areas with little to no monitoring.

Indeed, machine learning methods are in fact used to be able to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

[source]

This is what we aim to achieve in this case study.

Limitations

There are some important considerations regarding this data analysis to keep in mind:

  1. The data in our analysis does not include information about the composition of particulate mater. Different types of particulates may be more benign or deleterious for health outcomes.

  2. Outdoor pollution levels are not necessarily an indication of of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. People are now developing personal monitoring systems to track air pollution levels on the personal level.

Our analysis will use annual mean estimates, however pollution levels can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data, however we are interested in long term exposures, as these appear to be the most influential for health outcomes, so we chose to use annual level data.

What are the data?

In Machine Learning for prediction, there are two main types of variables:

  1. Outcome variable
  2. Predictor variables

The outcome variable is what are trying to predict. In building our model we actually have the outcome variable data, but we want to see how well our predictor variables can explain the variation in our outcome data. This gives us a sense of how well we can use the predictor variable data to predict our outcome variable levels when we in fact do not have data about the outcome.

As a simpler example, imagine that we have data about the sales and characteristics of cars from last year and we want to predict which cars might sell well this year. We do not have the sales data yet for this year, but we do know the characteristics of our cars for this year. We can use a model of the characteristics that explained sales last year to estimate what cars might sell well this year. In this case, our outcome variable is the sale performance of the cars, while the different characteristics of the cars make up our predictor or explanatory variables.

In this case study, we will evaluate air pollution monitor data of fine particulate mater (PM2.5) in the contiguous US from 2008, as well as data about population density, road density, urbanization levels, and NASA satellite data to develop models to predict localized air pollution levels.

The monitor data will be our outcome variable. We want to determine if we can predict air pollution levels based on other types of data, like road density and population density to see if we can use these data to predict air pollution in areas where there are no monitors.

Our outcome variable

The monitor data that we will be using comes from gravimetric monitors operated by the US Enivornmental Protection Agency (EPA). These monitors use a filtration system to specifically capture fine particulate matter. The weight of this matter is manually measured daily or weekly. See here for the EPA standard operating procedure for PM gravimetric analysis in 2008.

source

Here is an image of what the gravimetric monitors look like:

Gravimetric analysis is also used for emission testing. The same idea applies: a fresh filter is applied and the desired amount of time passes, then the filter is removed and weighed.

There are other monitoring systems that can provide hourly measurements, but we will not be using data from these monitors in our analysis. Gravimetric analysis is considered to be among the most accurate methods.

In our csv, the value column indicates the PM2.5 monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors. The units are micro grams of fine particulate mater (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m3). Recall the the WHO exposure guideline is < 10 ug/m3 on average annually for PM2.5.

Our predictor variables

There are 48 predictor variables with values for each of the 876 monitors included in our outcome variable. The data comes from the US Enivornmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

Click here to see a table about the variables

Variable Details
id Monitor number
– the county number is indicated before the decimal
– the monitor number is indicated after the decimal
Example: 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
fips Federal information processing standard number for the county where the monitor is located
– 5 digit id code for counties (zero is often the first value and sometimes is not shown)
– the first 2 numbers indicate the state
– the last three numbers indicate the county
Example: Alabama’s state code is 01 because it is first alphabetically
(note: Alaska and Hawaii are not included because they are not part of the contiguous US)
Lat Latitude of the monitor in degrees
Lon Longitude of the monitor in degrees
state State where the monitor is located
county County where the monitor is located
city City where the monitor is located
CMAQ Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ)
– A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
Does not use any of the PM2.5 gravimetric monitoring data. (There is a version that does use the gravimetric monitoring data, but not this one!)
– Data from the EPA
zcta Zip Code Tabulation Area where the monitor is located
– Postal Zip codes are converted into “generalized areal representations” that are non-overlapping
– Data from the 2010 Census
zcta_area Land area of the zip code area in meters squared
– Data from the 2010 Census
zcta_pop Population in the zip code area
– Data from the 2010 Census
imp_a500 Impervious surface measure
– Within a circle with a radius of 500 meters around the monitor
– Impervious surface are roads, concrete, parking lots, buildings
– This is a measure of development
imp_a1000 Impervious surface measure
– Within a circle with a radius of 1000 meters around the monitor
imp_a5000 Impervious surface measure
– Within a circle with a radius of 5000 meters around the monitor
imp_a10000 Impervious surface measure
– Within a circle with a radius of 10000 meters around the monitor
imp_a15000 Impervious surface measure
– Within a circle with a radius of 15000 meters around the monitor
county_area Land area of the county of the monitor in meters squared
county_pop Population of the county of the monitor
Log_dist_to_prisec Log (Natural log) distance to a primary or secondary road from the monitor
– Highway or major road
log_pri_length_5000 Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highways only
log_pri_length_10000 Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highways only
log_pri_length_15000 Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highways only
log_pri_length_25000 Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highways only
log_prisec_length_500 Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_1000 Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_5000 Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_10000 Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_15000 Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)
– Highway and secondary roads
log_prisec_length_25000 Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)
– Highway and secondary roads
log_nei_2008_pm25_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm25_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_15000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)
log_nei_2008_pm10_sum_25000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)
popdens_county Population density (number of people per kilometer squared area of the county)
popdens_zcta Population density (number of people per kilometer squared area of zcta)
nohs Percentage of people in zcta area where the monitor is that do not have a high school degree
– Data from the Census
somehs Percentage of people in zcta area where the monitor whose highest formal educational attainment was some high school education
– Data from the Census
hs Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a high school degree
– Data from the Census
somecollege Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing some college education
– Data from the Census
associate Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an associate degree
– Data from the Census
bachelor Percentage of people in zcta area where the monitor whose highest formal educational attainment was a bachelor’s degree
– Data from the Census
grad Percentage of people in zcta area where the monitor whose highest formal educational attainment was a graduate degree
– Data from the Census
pov Percentage of people in zcta area where the monitor is that lived in poverty in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines
– Data from the Census
hs_orless Percentage of people in zcta area where the monitor whose highest formal educational attainment was a high school degree or less (sum of nohs, somehs, and hs)
urc2013 2013 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}
urc2006 2006 Urban-rural classification of the county where the monitor is located
– 6 category variable - 1 is totally urban 6 is completely rural
– Data from the National Center for Health Statistics
aod Aerosol Optical Depth measurement from a NASA satellite
– based on the diffraction of a laser
– used as a proxy of particulate pollution
– unit-less - higher value indicates more pollution
– Data from NASA

Many of these predictor variables have to do with the circular area around the monitor called the “buffer”. These are illustrated in the following figure:

Data Import

We have one CSV file that contains both our single outcome variable and all of our predictor variables.

Let’s import our data into R now so that we can explore the data further. We will call our data object pm for particulate matter.

Data Exploration and Wrangling

The first step in performing a machine learning analysis is to explore the data to better understand the variables included in the data, as we may learn about important details about the data that we should keep in mind as we try to predict our outcome variable.

First let’s just get a general sense of our data. We can do that using the glimpse() function of the dplyr package (it is also in the tibble package).

We will also use the %>% pipe which can be used to define the input for later sequential steps. This will make more sense when we have multiple sequential steps using the same data object. To use the pipe notation we need to install and load dplyr as well.

For example here we will first grab the pm data object, then we use the glimpse() function on it based on the pipe notation.

Rows: 876
Columns: 50
$ id                          <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 1…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips                        <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta                        <dbl> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…

We can see that there are 876 monitors and that we have 50 total variables - one of which is the outcome. In this case our outcome variable is called value.

Notice that some of the variables that we would think of as factors (categorical) are currently of class double as indicated by the <dbl> just to the right of the column names/variable names in the glimpse() output. For example the monitor ID (id), the Federal Information Processing Standard number for the county where the monitor was located (fips), as well as the zcta

Let’s convert these variables into factors. We can do this using the mutate_at() function of the dplyr package and the as.factor() base function.

In this case we are also using the magrittr assignment pipe or double pipe that looks like this %<>% of the magrittr package. This allows us use the pm data as input but also reassign the output to the same data object name.

Rows: 876
Columns: 50
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips                        <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta                        <fct> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…

Great! Now we can see that these variables are now factors as indicated by <fct> after the variable name.

Packages to get a sense of the data

The skim() function of the skimr package is also really helpful for getting a general sense of your data.

Data summary
Name pm
Number of rows 876
Number of columns 50
_______________________
Column type frequency:
character 3
factor 3
numeric 44
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 4 20 0 49 0
county 0 1 3 20 0 471 0
city 0 1 4 48 0 607 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
id 0 1 FALSE 876 100: 1, 102: 1, 103: 1, 104: 1
fips 0 1 FALSE 569 170: 12, 603: 10, 261: 9, 107: 8
zcta 0 1 FALSE 842 475: 3, 110: 2, 160: 2, 290: 2

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
value 0 1 10.81 2.580000e+00 3.02 9.27 11.15 12.37 2.316000e+01 ▂▆▇▁▁
lat 0 1 38.48 4.620000e+00 25.47 35.03 39.30 41.66 4.840000e+01 ▁▃▅▇▂
lon 0 1 -91.74 1.496000e+01 -124.18 -99.16 -87.47 -80.69 -6.804000e+01 ▃▂▃▇▃
CMAQ 0 1 8.41 2.970000e+00 1.63 6.53 8.62 10.24 2.313000e+01 ▃▇▃▁▁
zcta_area 0 1 183173481.91 5.425989e+08 15459.00 14204601.75 37653560.50 160041508.25 8.164821e+09 ▇▁▁▁▁
zcta_pop 0 1 24227.58 1.777216e+04 0.00 9797.00 22014.00 35004.75 9.539700e+04 ▇▇▃▁▁
imp_a500 0 1 24.72 1.934000e+01 0.00 3.70 25.12 40.22 6.961000e+01 ▇▅▆▃▂
imp_a1000 0 1 24.26 1.802000e+01 0.00 5.32 24.53 38.59 6.750000e+01 ▇▅▆▃▁
imp_a5000 0 1 19.93 1.472000e+01 0.05 6.79 19.07 30.11 7.460000e+01 ▇▆▃▁▁
imp_a10000 0 1 15.82 1.381000e+01 0.09 4.54 12.36 24.17 7.209000e+01 ▇▃▂▁▁
imp_a15000 0 1 13.43 1.312000e+01 0.11 3.24 9.67 20.55 7.110000e+01 ▇▃▁▁▁
county_area 0 1 3768701992.12 6.212830e+09 33703512.00 1116536297.50 1690826566.50 2878192209.00 5.194723e+10 ▇▁▁▁▁
county_pop 0 1 687298.44 1.293489e+06 783.00 100948.00 280730.50 743159.00 9.818605e+06 ▇▁▁▁▁
log_dist_to_prisec 0 1 6.19 1.410000e+00 -1.46 5.43 6.36 7.15 1.045000e+01 ▁▁▃▇▁
log_pri_length_5000 0 1 9.82 1.080000e+00 8.52 8.52 10.05 10.73 1.205000e+01 ▇▂▆▅▂
log_pri_length_10000 0 1 10.92 1.130000e+00 9.21 9.80 11.17 11.83 1.302000e+01 ▇▂▇▇▃
log_pri_length_15000 0 1 11.50 1.150000e+00 9.62 10.87 11.72 12.40 1.359000e+01 ▆▂▇▇▃
log_pri_length_25000 0 1 12.24 1.100000e+00 10.13 11.69 12.46 13.05 1.436000e+01 ▅▃▇▇▃
log_prisec_length_500 0 1 6.99 9.500000e-01 6.21 6.21 6.21 7.82 9.400000e+00 ▇▁▂▂▁
log_prisec_length_1000 0 1 8.56 7.900000e-01 7.60 7.60 8.66 9.20 1.047000e+01 ▇▅▆▃▁
log_prisec_length_5000 0 1 11.28 7.800000e-01 8.52 10.91 11.42 11.83 1.278000e+01 ▁▁▃▇▃
log_prisec_length_10000 0 1 12.41 7.300000e-01 9.21 11.99 12.53 12.94 1.385000e+01 ▁▁▃▇▅
log_prisec_length_15000 0 1 13.03 7.200000e-01 9.62 12.59 13.13 13.57 1.441000e+01 ▁▁▃▇▅
log_prisec_length_25000 0 1 13.82 7.000000e-01 10.13 13.38 13.92 14.35 1.523000e+01 ▁▁▃▇▆
log_nei_2008_pm25_sum_10000 0 1 3.97 2.350000e+00 0.00 2.15 4.29 5.69 9.120000e+00 ▆▅▇▆▂
log_nei_2008_pm25_sum_15000 0 1 4.72 2.250000e+00 0.00 3.47 5.00 6.35 9.420000e+00 ▃▃▇▇▂
log_nei_2008_pm25_sum_25000 0 1 5.67 2.110000e+00 0.00 4.66 5.91 7.28 9.650000e+00 ▂▂▇▇▃
log_nei_2008_pm10_sum_10000 0 1 4.35 2.320000e+00 0.00 2.69 4.62 6.07 9.340000e+00 ▅▅▇▇▂
log_nei_2008_pm10_sum_15000 0 1 5.10 2.180000e+00 0.00 3.87 5.39 6.72 9.710000e+00 ▂▃▇▇▂
log_nei_2008_pm10_sum_25000 0 1 6.07 2.010000e+00 0.00 5.10 6.37 7.52 9.880000e+00 ▁▂▆▇▃
popdens_county 0 1 551.76 1.711510e+03 0.26 40.77 156.67 510.81 2.682191e+04 ▇▁▁▁▁
popdens_zcta 0 1 1279.66 2.757490e+03 0.00 101.15 610.35 1382.52 3.041884e+04 ▇▁▁▁▁
nohs 0 1 6.99 7.210000e+00 0.00 2.70 5.10 8.80 1.000000e+02 ▇▁▁▁▁
somehs 0 1 10.17 6.200000e+00 0.00 5.90 9.40 13.90 7.220000e+01 ▇▂▁▁▁
hs 0 1 30.32 1.140000e+01 0.00 23.80 30.75 36.10 1.000000e+02 ▂▇▂▁▁
somecollege 0 1 21.58 8.600000e+00 0.00 17.50 21.30 24.70 1.000000e+02 ▆▇▁▁▁
associate 0 1 7.13 4.010000e+00 0.00 4.90 7.10 8.80 7.140000e+01 ▇▁▁▁▁
bachelor 0 1 14.90 9.710000e+00 0.00 8.80 12.95 19.22 1.000000e+02 ▇▂▁▁▁
grad 0 1 8.91 8.650000e+00 0.00 3.90 6.70 11.00 1.000000e+02 ▇▁▁▁▁
pov 0 1 14.95 1.133000e+01 0.00 6.50 12.10 21.22 6.590000e+01 ▇▅▂▁▁
hs_orless 0 1 47.48 1.675000e+01 0.00 37.92 48.65 59.10 1.000000e+02 ▁▃▇▃▁
urc2013 0 1 2.92 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
urc2006 0 1 2.97 1.520000e+00 1.00 2.00 3.00 4.00 6.000000e+00 ▇▅▃▂▁
aod 0 1 43.70 1.956000e+01 5.00 31.66 40.17 49.67 1.430000e+02 ▃▇▁▁▁

Notice how there is a column called n_missing about the number of values that are missing. It looks like our data is very complete and we do not have any missing data. This is also indicated by the complete_rate variable, which shows the ratio of completeness, in our case all variables have a value of 1 indicating they are fully complete.

The n_unqiue column shows us the number of unique values for each of our columns. We can see that there are 49 states represented in the data, and we know that the data should be of the contiguous states. Let’s take a look to see which states are included:

# A tibble: 49 x 1
   state               
   <chr>               
 1 Alabama             
 2 Arizona             
 3 Arkansas            
 4 California          
 5 Colorado            
 6 Connecticut         
 7 Delaware            
 8 District Of Columbia
 9 Florida             
10 Georgia             
11 Idaho               
12 Illinois            
13 Indiana             
14 Iowa                
15 Kansas              
16 Kentucky            
17 Louisiana           
18 Maine               
19 Maryland            
20 Massachusetts       
21 Michigan            
22 Minnesota           
23 Mississippi         
24 Missouri            
25 Montana             
26 Nebraska            
27 Nevada              
28 New Hampshire       
29 New Jersey          
30 New Mexico          
31 New York            
32 North Carolina      
33 North Dakota        
34 Ohio                
35 Oklahoma            
36 Oregon              
37 Pennsylvania        
38 Rhode Island        
39 South Carolina      
40 South Dakota        
41 Tennessee           
42 Texas               
43 Utah                
44 Vermont             
45 Virginia            
46 Washington          
47 West Virginia       
48 Wisconsin           
49 Wyoming             

Looks like “District of Columbia” is being included as a state. We can see that indeed Alaska and Hawaii are not included in the data.

Here is another method of looking at the data using the dfSummary() function of the summarytoolspackage. We need to copy and paste the output into the rmarkdown.

Click here to see the dfSummary table

Dimensions: 876 x 50
Duplicates: 0

No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 id
[factor]
1. 1003.001
2. 1027.0001
3. 1033.1002
4. 1049.1003
5. 1055.001
6. 1069.0003
7. 1073.0023
8. 1073.1005
9. 1073.1009
10. 1073.101
[ 866 others ]
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
866 (98.9%)
876
(100%)
0
(0%)
2 value
[numeric]
Mean (sd) : 10.8 (2.6)
min < med < max:
3 < 11.2 < 23.2
IQR (CV) : 3.1 (0.2)
875 distinct values 876
(100%)
0
(0%)
3 fips
[factor]
1. 1003
2. 1027
3. 1033
4. 1049
5. 1055
6. 1069
7. 1073
8. 1089
9. 1097
10. 1101
[ 559 others ]
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
8 ( 0.9%)
1 ( 0.1%)
2 ( 0.2%)
1 ( 0.1%)
858 (98.0%)
876
(100%)
0
(0%)
4 lat
[numeric]
Mean (sd) : 38.5 (4.6)
min < med < max:
25.5 < 39.3 < 48.4
IQR (CV) : 6.6 (0.1)
876 distinct values 876
(100%)
0
(0%)
5 lon
[numeric]
Mean (sd) : -91.7 (15)
min < med < max:
-124.2 < -87.5 < -68
IQR (CV) : 18.5 (-0.2)
876 distinct values 876
(100%)
0
(0%)
6 state
[character]
1. California
2. Ohio
3. Illinois
4. Indiana
5. North Carolina
6. Pennsylvania
7. Michigan
8. Florida
9. Georgia
10. Texas
[ 39 others ]
85 ( 9.7%)
44 ( 5.0%)
38 ( 4.3%)
36 ( 4.1%)
35 ( 4.0%)
32 ( 3.7%)
30 ( 3.4%)
29 ( 3.3%)
28 ( 3.2%)
27 ( 3.1%)
492 (56.2%)
876
(100%)
0
(0%)
7 county
[character]
1. Jefferson
2. Cook
3. Hamilton
4. Lake
5. Los Angeles
6. Wayne
7. Washington
8. Cuyahoga
9. Jackson
10. Madison
[ 461 others ]
18 ( 2.1%)
12 ( 1.4%)
11 ( 1.3%)
11 ( 1.3%)
10 ( 1.1%)
10 ( 1.1%)
9 ( 1.0%)
7 ( 0.8%)
7 ( 0.8%)
7 ( 0.8%)
774 (88.4%)
876
(100%)
0
(0%)
8 city
[character]
1. Not in a city
2. New York
3. Cleveland
4. Baltimore
5. Chicago
6. Detroit
7. Milwaukee
8. New Haven
9. Philadelphia
10. Springfield
[ 597 others ]
103 (11.8%)
9 ( 1.0%)
6 ( 0.7%)
5 ( 0.6%)
5 ( 0.6%)
5 ( 0.6%)
5 ( 0.6%)
5 ( 0.6%)
5 ( 0.6%)
5 ( 0.6%)
723 (82.5%)
876
(100%)
0
(0%)
9 CMAQ  | Me [numeric] an (sd) : 8.4 (3)  | 60 min < med < max:
1.6 < 8.6 < 23.1
IQR (CV) : 3.7 (0.4)
1 distinct values | ![ ](tmp/ds0109.png) | 87 6  | 0
(100%)
(0%)
10 zcta
[factor]
1. 1022
2. 1103
3. 1201
4. 1608
5. 1832
6. 1840
7. 1863
8. 1904
9. 2113
10. 2119
[ 832 others ]
1 ( 0.1%)
2 ( 0.2%)
1 ( 0.1%)
2 ( 0.2%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
1 ( 0.1%)
864 (98.6%)
876
(100%)
0
(0%)
11 zcta_area
[numeric]
Mean (sd) : 183173481.9 (542598878.5)
min < med < max:
15459 < 37653560.5 < 8164820625
IQR (CV) : 145836906.5 (3)
842 distinct values 876
(100%)
0
(0%)
12 zcta_pop
[numeric]
Mean (sd) : 24227.6 (17772.2)
min < med < max:
0 < 22014 < 95397
IQR (CV) : 25207.8 (0.7)
837 distinct values 876
(100%)
0
(0%)
13 imp_a500
[numeric]
Mean (sd) : 24.7 (19.3)
min < med < max:
0 < 25.1 < 69.6
IQR (CV) : 36.5 (0.8)
816 distinct values 876
(100%)
0
(0%)
14 imp_a1000
[numeric]
Mean (sd) : 24.3 (18)
min < med < max:
0 < 24.5 < 67.5
IQR (CV) : 33.3 (0.7)
860 distinct values 876
(100%)
0
(0%)
15 imp_a5000
[numeric]
Mean (sd) : 19.9 (14.7)
min < med < max:
0.1 < 19.1 < 74.6
IQR (CV) : 23.3 (0.7)
870 distinct values 876
(100%)
0
(0%)
16 imp_a10000
[numeric]
Mean (sd) : 15.8 (13.8)
min < med < max:
0.1 < 12.4 < 72.1
IQR (CV) : 19.6 (0.9)
870 distinct values 876
(100%)
0
(0%)
17 imp_a15000
[numeric]
Mean (sd) : 13.4 (13.1)
min < med < max:
0.1 < 9.7 < 71.1
IQR (CV) : 17.3 (1)
870 distinct values 876
(100%)
0
(0%)
18 county_area
[numeric]
Mean (sd) : 3768701992.1 (6212829553.6)
min < med < max:
33703512 < 1690826566.5 < 51947229509
IQR (CV) : 1761655911.5 (1.6)
564 distinct values 876
(100%)
0
(0%)
19 county_pop
[numeric]
Mean (sd) : 687298.4 (1293488.7)
min < med < max:
783 < 280730.5 < 9818605
IQR (CV) : 642211 (1.9)
564 distinct values 876
(100%)
0
(0%)
20 log_dist_to_prisec
[numeric]
Mean (sd) : 6.2 (1.4)
min < med < max:
-1.5 < 6.4 < 10.5
IQR (CV) : 1.7 (0.2)
870 distinct values 876
(100%)
0
(0%)
21 log_pri_length_5000
[numeric]
Mean (sd) : 9.8 (1.1)
min < med < max:
8.5 < 10.1 < 12
IQR (CV) : 2.2 (0.1)
586 distinct values 876
(100%)
0
(0%)
22 log_pri_length_10000
[numeric]
Mean (sd) : 10.9 (1.1)
min < med < max:
9.2 < 11.2 < 13
IQR (CV) : 2 (0.1)
687 distinct values 876
(100%)
0
(0%)
23 log_pri_length_15000
[numeric]
Mean (sd) : 11.5 (1.1)
min < med < max:
9.6 < 11.7 < 13.6
IQR (CV) : 1.5 (0.1)
726 distinct values 876
(100%)
0
(0%)
24 log_pri_length_25000
[numeric]
Mean (sd) : 12.2 (1.1)
min < med < max:
10.1 < 12.5 < 14.4
IQR (CV) : 1.4 (0.1)
787 distinct values 876
(100%)
0
(0%)
25 log_prisec_length_500
[numeric]
Mean (sd) : 7 (1)
min < med < max:
6.2 < 6.2 < 9.4
IQR (CV) : 1.6 (0.1)
382 distinct values 876
(100%)
0
(0%)
26 log_prisec_length_1000
[numeric]
Mean (sd) : 8.6 (0.8)
min < med < max:
7.6 < 8.7 < 10.5
IQR (CV) : 1.6 (0.1)
591 distinct values 876
(100%)
0
(0%)
27 log_prisec_length_5000
[numeric]
Mean (sd) : 11.3 (0.8)
min < med < max:
8.5 < 11.4 < 12.8
IQR (CV) : 0.9 (0.1)
852 distinct values 876
(100%)
0
(0%)
28 log_prisec_length_10000
[numeric]
Mean (sd) : 12.4 (0.7)
min < med < max:
9.2 < 12.5 < 13.8
IQR (CV) : 1 (0.1)
867 distinct values 876
(100%)
0
(0%)
29 log_prisec_length_15000
[numeric]
Mean (sd) : 13 (0.7)
min < med < max:
9.6 < 13.1 < 14.4
IQR (CV) : 1 (0.1)
869 distinct values 876
(100%)
0
(0%)
30 log_prisec_length_25000
[numeric]
Mean (sd) : 13.8 (0.7)
min < med < max:
10.1 < 13.9 < 15.2
IQR (CV) : 1 (0.1)
870 distinct values 876
(100%)
0
(0%)
31 log_nei_2008_pm25_sum_10000
[numeric]
Mean (sd) : 4 (2.4)
min < med < max:
0 < 4.3 < 9.1
IQR (CV) : 3.5 (0.6)
828 distinct values 876
(100%)
0
(0%)
32 log_nei_2008_pm25_sum_15000
[numeric]
Mean (sd) : 4.7 (2.2)
min < med < max:
0 < 5 < 9.4
IQR (CV) : 2.9 (0.5)
855 distinct values 876
(100%)
0
(0%)
33 log_nei_2008_pm25_sum_25000
[numeric]
Mean (sd) : 5.7 (2.1)
min < med < max:
0 < 5.9 < 9.7
IQR (CV) : 2.6 (0.4)
860 distinct values 876
(100%)
0
(0%)
34 log_nei_2008_pm10_sum_10000
[numeric]
Mean (sd) : 4.3 (2.3)
min < med < max:
0 < 4.6 < 9.3
IQR (CV) : 3.4 (0.5)
829 distinct values 876
(100%)
0
(0%)
35 log_nei_2008_pm10_sum_15000
[numeric]
Mean (sd) : 5.1 (2.2)
min < med < max:
0 < 5.4 < 9.7
IQR (CV) : 2.8 (0.4)
855 distinct values 876
(100%)
0
(0%)
36 log_nei_2008_pm10_sum_25000
[numeric]
Mean (sd) : 6.1 (2)
min < med < max:
0 < 6.4 < 9.9
IQR (CV) : 2.4 (0.3)
860 distinct values 876
(100%)
0
(0%)
37 popdens_county
[numeric]
Mean (sd) : 551.8 (1711.5)
min < med < max:
0.3 < 156.7 < 26821.9
IQR (CV) : 470 (3.1)
564 distinct values 876
(100%)
0
(0%)
38 popdens_zcta
[numeric]
Mean (sd) : 1279.7 (2757.5)
min < med < max:
0 < 610.3 < 30418.8
IQR (CV) : 1281.4 (2.2)
840 distinct values 876
(100%)
0
(0%)
39 nohs
[numeric]
Mean (sd) : 7 (7.2)
min < med < max:
0 < 5.1 < 100
IQR (CV) : 6.1 (1)
215 distinct values 876
(100%)
0
(0%)
40 somehs
[numeric]
Mean (sd) : 10.2 (6.2)
min < med < max:
0 < 9.4 < 72.2
IQR (CV) : 8 (0.6)
230 distinct values 876
(100%)
0
(0%)
41 hs
[numeric]
Mean (sd) : 30.3 (11.4)
min < med < max:
0 < 30.8 < 100
IQR (CV) : 12.3 (0.4)
347 distinct values 876
(100%)
0
(0%)
42 somecollege
[numeric]
Mean (sd) : 21.6 (8.6)
min < med < max:
0 < 21.3 < 100
IQR (CV) : 7.2 (0.4)
240 distinct values 876
(100%)
0
(0%)
43 associate
[numeric]
Mean (sd) : 7.1 (4)
min < med < max:
0 < 7.1 < 71.4
IQR (CV) : 3.9 (0.6)
157 distinct values 876
(100%)
0
(0%)
44 bachelor
[numeric]
Mean (sd) : 14.9 (9.7)
min < med < max:
0 < 12.9 < 100
IQR (CV) : 10.4 (0.7)
301 distinct values 876
(100%)
0
(0%)
45 grad
[numeric]
Mean (sd) : 8.9 (8.6)
min < med < max:
0 < 6.7 < 100
IQR (CV) : 7.1 (1)
245 distinct values 876
(100%)
0
(0%)
46 pov
[numeric]
Mean (sd) : 15 (11.3)
min < med < max:
0 < 12.1 < 65.9
IQR (CV) : 14.7 (0.8)
345 distinct values 876
(100%)
0
(0%)
47 hs_orless
[numeric]
Mean (sd) : 47.5 (16.8)
min < med < max:
0 < 48.7 < 100
IQR (CV) : 21.2 (0.4)
464 distinct values 876
(100%)
0
(0%)
48 urc2013
[numeric]
Mean (sd) : 2.9 (1.5)
min < med < max:
1 < 3 < 6
IQR (CV) : 2 (0.5)
1 : 203 (23.2%)
2 : 163 (18.6%)
3 : 228 (26.0%)
4 : 123 (14.0%)
5 : 101 (11.5%)
6 : 58 ( 6.6%)
876
(100%)
0
(0%)
49 urc2006
[numeric]
Mean (sd) : 3 (1.5)
min < med < max:
1 < 3 < 6
IQR (CV) : 2 (0.5)
1 : 195 (22.3%)
2 : 162 (18.5%)
3 : 221 (25.2%)
4 : 127 (14.5%)
5 : 115 (13.1%)
6 : 56 ( 6.4%)
876
(100%)
0
(0%)
50 aod
[numeric]
Mean (sd) : 43.7 (19.6)
min < med < max:
5 < 40.2 < 143
IQR (CV) : 18 (0.4)
581 distinct values 876
(100%)
0
(0%)

We can see that for many variables there are many low values as the distribution shows two peaks, one near zero and another with a higher value. This is true for the imp variables (measures of development), the nei variables (measures of emission sources) and the road density variables. We can also see that the range of some of the variables is very large, in particular the area and population related variables.

Evaluate correlation among possible predictors

In prediction analyses, it is also useful to evaluate if any of the variables are correlated.

Intuitively we can expect some of our variables to be correlated.

Let’s first take a look at all of our numeric variabels with thecorrplot package: The corrplot package is another option to look at correlation among possible predictors. This is a great option if we have many predictors. First we need to create a correlation matrix using the cor() function of the stats package (which is loaded automatically).

Using ggcorplot package

We can see that the the development variables (imp) variables are correlated with each other as we might expect. We also see that the road density variables seem to be correlated with each other, and the emission variables seem to be correlated with each other. We can take a closer look using the ggcorr() function and the ggpairs() function of the GGally package. To select our variables of interest we can use the select() function with the contains() function of the tidyr package.

First let’s look at the imp/development variables.

Indeed, we can see that imp_a1000 and imp_a500 are perfectly correlated, as well as imp_a10000, imp_a15000.

Now let’s take a look at the road density data:

We can see that many of the road density variables are highly correlated with one another, while others are less so.

Finally let’s look at the emission variables.

We would also expect the population density data might correlate with some of these variables. Let’s take a look.

Interesting, so these variables don’t appear to be highly correlated, therefore we might need variables from each of the categories to predict our monitor PM2.5 pollution values.

We seem to have some pretty extreme population values though, so let’s see what happens when we take the log value.

Indeed this increased the correlation, but variables from each of these categories may still prove to be useful for prediction.

Data Analysis

Now that we have a sense of what our data is like we can get started with data analysis.

The machine learning process

There are two major types of machine learning:

  1. Unsupervised
  2. Supervised

Unsupervised learning is used to learn about the structure of the data without knowing much about the data. We let the data reveal properties about itself. Examples include clustering the data into groups or reducing the dimensionality of the data using methods like principal component analysis (which we will describe in more detail later) to capture patterns of variance within the data.

source

In contrast, in supervised learning we have some knowledge about the data that we want to use to create a model to be able to generalize about other similar data.

There are two distinct goals of supervised machine learning:

  1. Prediction
  2. Classification

source

We will be performing a prediction analysis (which is also referred to as regression), which aims to predict continuous outcome variables given a number of predictors/explanatory variables/features/parameters, as we have already described.

Classification on the other hand aims to discern or predict group identity for a categorical outcome based on a number of predictors/explanatory variables/features/parameters.

The overall process is the same in either case and involves the following steps (which will each be explained in detail):

  1. Data exploration

We have already performed this step to get a sense of the data. It is important to know if we have NA values, to understand the class of variables, and if to determine if there are any redundant variables that might need to removed.

  1. Data splitting

The data needs to be split into two pieces: a training set and a testing set. The training set will be used to optimize the model, while the testing set will be used to evaluate model performance.

  1. Variable assignment and preprocessing

Both the training and testing data needs to be processed so that the data is compatible and optimized to be used with the model. This involves assigning variables to specific roles within the model and preprocessing like scaling variables and removing redundant variables. This process is also called feature engineering.

  1. Model specification, fitting, tuning and performance evaluation using the training data

The model needs to first be fit to the training data. First the method or algorithm in which the model will be fit is specified (regression, random forest etc.). Then in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. If the model fits well than these estimated values will be very similar to the true outcome variable values. If the model does not fit well, than these estimates will be more disimilar from the true outcome variable. In this case, aspects about the model may need to be modified to improve the similarity of the estimates with that of the true outcome values. One way to optimize model performance is a process called tuning in which different model hyper-parameter options are tested to determine the best option for model performance.

  1. Overall model performance evaluation

Model performance is assessed as the similarity between the estimates of the outcome variable produced by the model and the true outcome variable values. This is done typically as an iterative process with the training data along side modification of the model until the performance using the training data is satisfactory. At this point, the final model performance is assessed using the testing data. This then gives an estimate about how well the model will predict or classify the outcome variable of interest with new independent data. Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources.

The tidymodels ecosystem

To perform our analysis we will be using the tidymodels suite of packages. You may be familiar with the older packages caret or mlr which are also for machine learning and modeling but are not a part of the tidyverse. Max Kuhn describes tidymodels like this:

“Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: preprocessing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret. The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.”

There are many packages in the tidymodels ecosystem which assist with the various steps of the machine learning process:

This is a depiction of how these tools help perform the overall machine learning process:

The major benefits of tidymodels

  1. Standardized workflow/format/notation across different types of algorithms

Different notations are required for different algorithms as the algorithms have been developed by many different people. This would require the painstaking process of reformatting the data to be compatible with each algorithm if multiple algorithms were tested.

  1. Can easily modify preprocessing, algorithm choice, and hyper-parameter tuning making optimization easy

Modifying a piece of the overall process is now much easier than before because many of the steps are specified using the tidymodel packages in a convenient manner. Thus the entire process can be rerun after a simple change to preprocessing without much difficulty.

Splitting the Data

The first step after data exploration in machine learning analysis is to split the data into training and testing datasets.

The training dataset will be used to build and tune our model. This is the data that the model “learns” on.

The testing set will be used to evaluate the performance of our model in a more generalizable way. What do we mean by “generalizable”?

Remember that our main goal is to use our model to be able to predict air pollution levels in areas where there are no gravimetric monitors. Therefore, if our model is super good at predicting air pollution with the data that we use to build it, it might not do the best job for the areas where there are few to no monitors. This would cause us to have really good prediction accuracy and we might assume that we were going to do a good job estimating air pollution any time we use our model, but in fact this would likely not be the case. This situation is what we call overfitting .

Overfitting happens when we end up modeling not only the major relationships in our data but also the noise within our data.

source

If we get fairly good prediction with our testing set then we will know that our model can be applied to other data and will perform fairly well. We will discuss this more later.

We will not touch the testing set until we have completed optimizing our model with the training set. This will allow us to have a less biased evaluation of how well our model can do with other data besides the data used in the training set to build the model. Ideally you would also want a completely independent dataset to further test the performance of your model.

Here is a great description of the differences between testing and training datasets.

We will use the rsample package to perform this step.

Theinitial_split() function allows us to specify how we want to split our data. Typically data is split into 3/4 for training and 1/4 for testing.This is the default proportion and does not need to be specified. However you can change the proportion using the prop argument, which we will do that here for illustrative purposes. You can also specify a variable to stratify by with the strata argument. This is useful if you have imbalanced categorical variables and you would like to intentionally make sure that there are similar number of samples of the rarer categories in both the testing and training sets. Otherwise the split is performed randomly.

The strata argument causes the random sampling to be conducted within the stratification variable. The can help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.

In the case with our dataset, perhaps we would like our training set to have similar proportions of monitors from each of the states as in the initial data. This might be useful if we want our model to be generalizable across all of the states.

We can see that indeed there are different proportions of monitors in each state by using the count() function of the dpyr package.

# A tibble: 49 x 2
   state                    n
   <chr>                <int>
 1 Alabama                 24
 2 Arizona                 17
 3 Arkansas                16
 4 California              85
 5 Colorado                15
 6 Connecticut             14
 7 Delaware                 7
 8 District Of Columbia     3
 9 Florida                 29
10 Georgia                 28
11 Idaho                    7
12 Illinois                38
13 Indiana                 36
14 Iowa                    20
15 Kansas                  10
16 Kentucky                22
17 Louisiana               17
18 Maine                    1
19 Maryland                15
20 Massachusetts           16
21 Michigan                30
22 Minnesota               17
23 Mississippi             12
24 Missouri                13
25 Montana                 16
26 Nebraska                 7
27 Nevada                   4
28 New Hampshire            7
29 New Jersey              23
30 New Mexico              10
31 New York                24
32 North Carolina          35
33 North Dakota             4
34 Ohio                    44
35 Oklahoma                10
36 Oregon                  17
37 Pennsylvania            32
38 Rhode Island             5
39 South Carolina          14
40 South Dakota             9
41 Tennessee                3
42 Texas                   27
43 Utah                    14
44 Vermont                  4
45 Virginia                20
46 Washington               8
47 West Virginia           14
48 Wisconsin               21
49 Wyoming                 12

If our dataset were large enough it might be nice then to stratify by state, but our data is unfortunately not large enough. We will show how one would do this though for illustrative purposes. This option is often more important for classification applications of machine learning than it is for prediction.

Since the split is performed randomly, it is a good idea to use the set.seed() base function to ensure that if your rerun your code that your split will be the same next time. We can see the number of monitors in our training, testing, and original data by typing in the name of our split object. The result will look like this: <training data sample number, testing data sample number, original sample number>

<Analysis/Assess/Total>
<584/292/876>

Importantly the initial_split function only determines what rows of our pm data frame should be assigned for training or testing, it does not actually split the data.

To extract the testing and training data we can use the training() and testing() functions also of the rsample package.

# A tibble: 48 x 2
   state                    n
   <chr>                <int>
 1 Alabama                 18
 2 Arizona                 12
 3 Arkansas                14
 4 California              54
 5 Colorado                12
 6 Connecticut              8
 7 Delaware                 6
 8 District Of Columbia     2
 9 Florida                 18
10 Georgia                 17
# … with 38 more rows
# A tibble: 48 x 2
   state                    n
   <chr>                <int>
 1 Alabama                  6
 2 Arizona                  5
 3 Arkansas                 2
 4 California              31
 5 Colorado                 3
 6 Connecticut              6
 7 Delaware                 1
 8 District Of Columbia     1
 9 Florida                 11
10 Georgia                 11
# … with 38 more rows

Variable Role Assignment and Preprocessing

In tidymodels we will create a recipe, which is a standardized format for a sequence of steps for processing the data.

This can be very useful because it makes testing out different preprocessing steps or different algorithms with the same preprocessing very easy and reproducible.

Creating a recipe specifies how a data frame of predictors should be created - it specifies what variables to be used and the preprocessing steps but it does not execute these steps or create the data frame of predictors.

List the ingredients / specify the variables with the recipe() function

The first thing to do to create a recipe is to specify which variables we will be using as our outcome and predictors using the recipe() function. In terms of the metaphor of baking, we can think of this as listing our ingredients. The naming convention for recipe object names is *_rec or rec.

In our case recall that our value variable, which is the average annual gravimetric monitor PM2.5 concentration in ug/m3. Our predictors are all the other variables except the monitor ID, which is an id variable.

The reason not to include this variable is because this variable includes the county number and a number designating which particular monitor the values came from of the monitors there are in that county. Since this number is arbitrary and the county information is also given in the data, and the fact that each monitor only has one value in the value variable, nothing is gained by including this variable and it may instead introduce noise. However, it is useful to keep this data to take a look at what is happening later. We will show you what to do in this case in just a bit.

The simplest recipe with no preprocessing steps, would be to simply list the outcome and predictor variables.

We can do so in two ways:

  1. Using formula notation
  2. Assigning roles to each variable

Let’s look at the first way using formula notation, which looks like this:

outcome(s) ~ predictor(s)

If in the case of multiple predictors or a multivariate situation with two outcomes, use a plus sign

outcome1 + outcome2 ~ predictor1 + predictor2

If we want to include all predictors we can use a period like so:

outcome_variable_name ~ .

Now with our data we will start by making a recipe for our training data. In the simplest case we might use all predictors like this:

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor         49

However, to deal with the id variable we could use the update_role() function of the recipes package. This option works well with the newer workflows package, however id variables are often dropped from analyses that do not use this newer package as they can make the process difficult with using the parsnip package alone due to the fact that new levels (or possible values) may be introduced with the testing data.

Data Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

We could also specify the outcome and predictors in the same way as the id variable. Please see here for examples of other roles for variables. The role can be actually be any value.

The order is important here, as we first make all variables predictors and then override this role for the outcome and id variable. We will use the everything() function of the dplyr package to start with all of the variables in train_pm.

Data Recipe

Inputs:

        role #variables
 id variable          1
     outcome          1
   predictor         48

If we want to take a look at our formula from our recipe we can do use the formula() function of the stats package.

value ~ fips + lat + lon + state + county + city + CMAQ + zcta + 
    zcta_area + zcta_pop + imp_a500 + imp_a1000 + imp_a5000 + 
    imp_a10000 + imp_a15000 + county_area + county_pop + log_dist_to_prisec + 
    log_pri_length_5000 + log_pri_length_10000 + log_pri_length_15000 + 
    log_pri_length_25000 + log_prisec_length_500 + log_prisec_length_1000 + 
    log_prisec_length_5000 + log_prisec_length_10000 + log_prisec_length_15000 + 
    log_prisec_length_25000 + log_nei_2008_pm25_sum_10000 + log_nei_2008_pm25_sum_15000 + 
    log_nei_2008_pm25_sum_25000 + log_nei_2008_pm10_sum_10000 + 
    log_nei_2008_pm10_sum_15000 + log_nei_2008_pm10_sum_25000 + 
    popdens_county + popdens_zcta + nohs + somehs + hs + somecollege + 
    associate + bachelor + grad + pov + hs_orless + urc2013 + 
    urc2006 + aod
<environment: 0x7f89941e6120>

We can also view our recipe in more detail using the base summary() function.

# A tibble: 50 x 4
   variable type    role        source  
   <chr>    <chr>   <chr>       <chr>   
 1 id       nominal id variable original
 2 value    numeric outcome     original
 3 fips     nominal predictor   original
 4 lat      numeric predictor   original
 5 lon      numeric predictor   original
 6 state    nominal predictor   original
 7 county   nominal predictor   original
 8 city     nominal predictor   original
 9 CMAQ     numeric predictor   original
10 zcta     nominal predictor   original
# … with 40 more rows

List the preprocessing steps using the step functions of the recipe package

The other thing the recipes package allows for is specifying preprocessing steps using a variety of step*() functions.

This link and this link show the many options for recipe step functions.

There are step functions for a variety of purposes:

  1. Imputation – which means filling in missing values based on the existing data
  2. Transformation – which means changing all values of a variable in the same way, typically to make it more normal or easier to interpret)
  3. Discretization – which means converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels) (However this is generally not advisable!)
  4. Encoding / Creating Dummy Variables – which means creating a numeric code for categorical variables More on Dummy Variables and one hot encoding
  5. Data type conversions – which means changing from integer to factor or numeric to date etc.
  6. Interaction term addition to the model – which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
  7. Normalization – which means centering and scaling the data to a similar range of values
  8. Dimensionality Reduction/ Signal Extraction – which means mathematically obtaining a new smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
  9. Filtering – Filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
  10. Row operations – which means performing functions on the values within the rows (ex. rearranging, filtering, imputing)
  11. Checking functions – Sanity checks to look for missing values, to look at the variable classes etc.

All of the step functions look like step_* except for the check functions which look like check_*.

There are several ways to select what variables to apply steps to:
1) tidyselect methods: contains(), matches(), starts_with(), ends_with(), everything(), num_range()
2) based on the type: all_nominal(), all_numeric() , has_type() 3) based on the role: all_predictors(), all_outcomes(), has_role() 4) name - use the actual name of the variable/variables of interest

Let’s try adding some steps to our recipe.

We might consider log transforming our population and area variables (that aren’t densities) - let’s take a look at the range of these variables.

We can see that the range for each of these variables is quite large, we can log transform this data using the step_log() function of the recipes package.

We would also want to potentially one hot encode some of our categorical variables so that they can be used with certain algorithms. We can do this with the step_dummy() function and the one_hot = TRUE argument. one hot encoding means that we don’t just simply encode our categorical variables numerically, as our numeric assignments can be interpreted by algorithms as having a particular rank or order. Instead, binary variables made of 1s and 0s are used to arbitrarily assign a numeric value that has no apparent order.

Our fips variable includes a numeric code for state and county - and therefore is essentially a proxy for county. Since we already have county, we will just use it and keep the fips id as another ID variable.

We can remove the fips variable from the predictors using update_role() to make sure that the role is no longer "predictor". We can make the role anything we want actually, so we will keep it something identifiable.

We might also want to remove variables that appear to be redundant and are highly correlated with others, as we know from our exploratory data analysis that many of our variables are correlated with one another. We can do this using the step_corr() function.

It is also a good idea to remove variables with near-zero variance, which can be done with the step_nzv() function. Variables have low variance if all the values are very similar, the values are very sparse, or if they are highly imbalanced.

Examples where you might have near-zero variance variables include:

  1. Similar Values - If the population density was nearly the same for every zcta that contained a monitor, then knowing the population density near our monitor would contribute little to our model in assisting us to predict monitor air pollution values.
  2. Sparse Data - If all of the monitors were in locations where the populations did not attend graduate school, then these values would mostly be zero, again this would do very little to help us distinguish our air pollution monitors.When many of the values are zero this is also called sparse data.
  3. Imbalanced Data If nearly all of the monitors were located in one particular state, and all the others only had one monitor each, then the real predictive value would simply be in knowing if a monitor is located in that particular state or not. In this case we don’t want to remove our variable, we just want to simplify it.

See this blog post about why removing near-zero variance variables isn’t always a good idea if we think that a variable might be especially informative.

It is important to add the steps to the recipe in an order that makes sense just like with a cooking recipe.

Thus first we are going to create numeric values for our categorical variables, then we will look at correlation and near-zero variance. We don’t want to remove some of our variables, like the CMAQ and aod variables so we can make sure they are kept in the model by excluding them from those steps. If we specifically wanted to remove a predictor we could use step_rm().

Data Recipe

Inputs:

        role #variables
   county id          1
 id variable          1
     outcome          1
   predictor         47

Operations:

Dummy variables from state, county, city, zcta
Correlation filter on all_predictors, -, CMAQ, -, aod
Sparse, unbalanced variable filter on all_predictors, -, CMAQ, -, aod

Running the preprocessing

The next major function of the recipes package is prep().

This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for preprocessing and updates the model terms, as some of the predictors may be removed, this allows the recipe to be ready to use on other datasets. It doesn’t necessarily actually execute the preprocessing itself, however we will specify in argument for it to do this so that we can take a look at the preprocessed data.

There are some important arguments to know about: 1) training - you must supply a training data set to estimate parameters for preprocessing operations (recipe steps) - this may already be included in your recipe - as is the case for us 2) fresh - if TRUE - will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe 3) verbose - if TRUE shows the progress as the steps are evaluated and the size of the preprocessed training set 4) retain - if TRUE then the preprocessed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and don’t want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the preprocessed data.

oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.26 Mb  in memory.
[1] "var_info"       "term_info"      "steps"          "template"      
[5] "levels"         "retained"       "tr_info"        "orig_lvls"     
[9] "last_term_info"

There are also lots of useful things to checkout in the output of prep(). You can see: 1) the steps that were run
2) the variable info (var_info)
3) the model term_info 4) the new levels of the variables 5) the original levels of the variables orig_lvls
6) info about the training data set size and completeness (tr_info)

Note: You may see the prep.recipe() function in material that you read about the recipes package. This is referring to the prep() function of the recipes package.

Extracting the preprocessed training data

Since we retained our preprocessed training data, we can take a look at it like by using the juice() function of the recipes package like this:

Rows: 584
Columns: 36
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips                        <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop                  <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000     <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta                <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs                      <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs                          <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate                   <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov                         <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013                     <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod                         <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…

For easy comparison sake - here is our original data:

Rows: 876
Columns: 50
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1049.1003…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 11.659091…
$ fips                        <fct> 1003, 1027, 1033, 1049, 1055, 1069, 1073,…
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 3…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.9683…
$ state                       <chr> "Alabama", "Alabama", "Alabama", "Alabama…
$ county                      <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "…
$ city                        <chr> "Fairhope", "Ashland", "Muscle Shoals", "…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9…
$ zcta                        <fct> 36532, 36251, 35660, 35962, 35901, 36303,…
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 203836235…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 90…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 5.78…
$ imp_a1000                   <dbl> 1.4096021, 0.8531574, 11.1448962, 3.86764…
$ imp_a5000                   <dbl> 3.3360118, 0.9851479, 15.1786154, 1.23114…
$ imp_a10000                  <dbl> 1.9879187, 0.5208189, 9.7253870, 1.031646…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 0.973044…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 20126…
$ county_pop                  <dbl> 182265, 13932, 54428, 71109, 104430, 1015…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9…
$ log_pri_length_10000        <dbl> 9.210340, 9.210340, 9.274303, 10.409411, …
$ log_pri_length_15000        <dbl> 9.630228, 9.615805, 9.658899, 11.173626, …
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 1…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 10.21420…
$ log_prisec_length_10000     <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 1…
$ log_prisec_length_15000     <dbl> 12.205723, 12.042963, 13.282656, 12.35366…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 1…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.350…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 35.330814…
$ popdens_zcta                <dbl> 145.716431, 13.639555, 540.887040, 40.718…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7,…
$ somehs                      <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, …
$ hs                          <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2,…
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5,…
$ associate                   <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, …
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, …
$ pov                         <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4,…
$ urc2013                     <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ urc2006                     <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2,…
$ aod                         <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 4…

Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (fips and the actual monitor ID (id)) and one is our outcome (value). Thus we only have 33 predictors now. We can also see that variables that we no longer have any categorical variables. Variables like state are gone and only state_California remains as it was the only state identity to have nonzero variance. We can see that California had the largest number of monitors compared to the other states. We can also see that there were more monitors listed as "Not in a city" than any city.

# A tibble: 49 x 2
   state                    n
   <chr>                <int>
 1 Alabama                 24
 2 Arizona                 17
 3 Arkansas                16
 4 California              85
 5 Colorado                15
 6 Connecticut             14
 7 Delaware                 7
 8 District Of Columbia     3
 9 Florida                 29
10 Georgia                 28
11 Idaho                    7
12 Illinois                38
13 Indiana                 36
14 Iowa                    20
15 Kansas                  10
16 Kentucky                22
17 Louisiana               17
18 Maine                    1
19 Maryland                15
20 Massachusetts           16
21 Michigan                30
22 Minnesota               17
23 Mississippi             12
24 Missouri                13
25 Montana                 16
26 Nebraska                 7
27 Nevada                   4
28 New Hampshire            7
29 New Jersey              23
30 New Mexico              10
31 New York                24
32 North Carolina          35
33 North Dakota             4
34 Ohio                    44
35 Oklahoma                10
36 Oregon                  17
37 Pennsylvania            32
38 Rhode Island             5
39 South Carolina          14
40 South Dakota             9
41 Tennessee                3
42 Texas                   27
43 Utah                    14
44 Vermont                  4
45 Virginia                20
46 Washington               8
47 West Virginia           14
48 Wisconsin               21
49 Wyoming                 12

# A tibble: 607 x 2
    city                                                 n
    <chr>                                            <int>
  1 Aberdeen                                             1
  2 Akron                                                2
  3 Albany                                               3
  4 Albuquerque                                          2
  5 Alexandria                                           1
  6 Allen Park                                           1
  7 Altamont                                             1
  8 Alton                                                1
  9 Amarillo                                             1
 10 Anadarko                                             1
 11 Anaheim                                              1
 12 Anderson                                             1
 13 Annandale                                            1
 14 Apache Junction                                      1
 15 Apple Valley                                         1
 16 Appleton                                             1
 17 Arden-Arcade                                         1
 18 Arlington                                            1
 19 Arnold                                               1
 20 Asheville                                            1
 21 Ashland                                              2
 22 Atascadero                                           1
 23 Athens-Clarke County (Remainder)                     1
 24 Atlanta                                              2
 25 Atlantic City                                        1
 26 Augusta-Richmond County (Remainder)                  2
 27 Aurora                                               1
 28 Austin                                               1
 29 Azusa                                                1
 30 Bakersfield                                          3
 31 Baltimore                                            5
 32 Batavia                                              1
 33 Baton Rouge                                          1
 34 Bay City                                             1
 35 Bayport                                              1
 36 Baytown                                              1
 37 Beaver Falls                                         1
 38 Beckley                                              1
 39 Belle Glade                                          1
 40 Bellevue                                             1
 41 Beltsville                                           1
 42 Bend                                                 1
 43 Bennington                                           1
 44 Bensley                                              1
 45 Big Bear City                                        1
 46 Billings                                             1
 47 Birmingham                                           2
 48 Bismarck                                             1
 49 Bladensburg                                          1
 50 Blair                                                1
 51 Blue Ash                                             1
 52 Blue Island                                          1
 53 Boise (corporate name Boise City)                    1
 54 Boone                                                1
 55 Boston                                               4
 56 Boulder                                              1
 57 Boulevard                                            1
 58 Bountiful                                            1
 59 Braidwood                                            1
 60 Brawley                                              1
 61 Bridgeport                                           1
 62 Brigham City                                         1
 63 Bristol                                              2
 64 Brockton                                             1
 65 Brook Park                                           1
 66 Brookings                                            1
 67 Brunswick                                            1
 68 Bryson City (RR name Bryson)                         1
 69 Buffalo                                              1
 70 Burbank                                              1
 71 Burlington                                           2
 72 Burns                                                1
 73 Butte-Silver Bow (Remainder)                         1
 74 Calexico                                             1
 75 Camden                                               1
 76 Candor                                               1
 77 Canton                                               2
 78 Carlisle                                             1
 79 Carlstadt                                            3
 80 Cary                                                 1
 81 Casa Grande                                          1
 82 Cedar Rapids                                         2
 83 Cedarhurst                                           1
 84 Central Point                                        1
 85 Chalmette                                            1
 86 Champaign                                            1
 87 Chapel Hill                                          1
 88 Charleroi                                            1
 89 Charleston                                           2
 90 Charlotte                                            2
 91 Chattanooga                                          1
 92 Chelmsford (Chelmsford Center)                       1
 93 Chester                                              2
 94 Cheyenne                                             1
 95 Chicago                                              5
 96 Chickasaw                                            1
 97 Chicopee                                             1
 98 Childersburg                                         1
 99 Chula Vista                                          1
100 Cicero                                               1
101 Cincinnati                                           3
102 Clairton                                             1
103 Claremont                                            1
104 Clarion                                              1
105 Clarksburg                                           1
106 Clearwater                                           1
107 Cleveland                                            6
108 Clinton                                              2
109 Clive                                                1
110 Clovis                                               1
111 Cockeysville                                         1
112 Cody                                                 1
113 Coloma                                               1
114 Colorado Springs                                     1
115 Columbia                                             1
116 Columbia Falls                                       1
117 Columbus                                             4
118 Columbus (Remainder)                                 3
119 Colusa                                               1
120 Commerce City                                        1
121 Concord                                              1
122 Conway                                               1
123 Corcoran                                             1
124 Cornwall                                             1
125 Corpus Christi                                       2
126 Cottage Grove                                        1
127 Cottonwood West                                      1
128 Council Bluffs                                       1
129 Covington                                            1
130 Crossett                                             1
131 Crossville                                           1
132 Dale                                                 1
133 Dallas                                               3
134 Danbury                                              1
135 Darrington                                           1
136 Davenport                                            3
137 Davie                                                1
138 Dayton                                               1
139 Dearborn                                             1
140 Decatur                                              2
141 Delray Beach                                         1
142 Dentsville (Dents)                                   1
143 Denver                                               3
144 Des Moines                                           1
145 Des Plaines                                          1
146 Detroit                                              5
147 Doraville                                            1
148 Dothan                                               1
149 Douglas                                              1
150 Dover                                                1
151 Duluth                                               2
152 Durham                                               1
153 East Chicago                                         1
154 East Farmingdale                                     1
155 East Hartford                                        2
156 East Highland Park                                   1
157 East Providence                                      1
158 East Ridge                                           1
159 East Saint Louis                                     1
160 East Syracuse                                        1
161 Edgewood                                             1
162 El Cajon                                             1
163 El Centro                                            1
164 El Dorado                                            1
165 El Paso                                              3
166 Elgin                                                1
167 Elizabeth                                            2
168 Elizabethtown                                        1
169 Elkhart                                              1
170 Emmetsburg                                           1
171 Erie                                                 1
172 Escondido                                            1
173 Essex                                                1
174 Eugene                                               2
175 Eureka                                               2
176 Evansville                                           3
177 Fairfield                                            1
178 Fairhope                                             1
179 Fairmont                                             1
180 Fall River                                           1
181 Farmington                                           1
182 Farrell                                              1
183 Fayetteville                                         1
184 Ferry Pass                                           1
185 Flagstaff                                            1
186 Flint                                                1
187 Follansbee                                           1
188 Fontana                                              1
189 Forest Park                                          1
190 Fort Collins                                         1
191 Fort Defiance                                        1
192 Fort Lee                                             1
193 Fort Myers                                           1
194 Fort Pierce                                          1
195 Fort Smith                                           1
196 Fort Wayne                                           1
197 Fort Worth                                           2
198 Frankfort                                            1
199 Franklin                                             1
200 Freemansburg                                         1
201 Fremont                                              1
202 Fresno                                               2
203 Gadsden                                              1
204 Gainesville                                          3
205 Galloway (Township of)                               1
206 Garden City                                          1
207 Gary                                                 3
208 Gastonia                                             1
209 Gilroy                                               1
210 Glen Burnie                                          1
211 Goldsboro                                            1
212 Gordon                                               1
213 Grand Island                                         1
214 Grand Junction                                       1
215 Grand Rapids                                         2
216 Granite City                                         2
217 Grants Pass                                          1
218 Grass Valley                                         1
219 Great Falls                                          1
220 Greater Upper Marlboro                               1
221 Greeley                                              1
222 Green Bay                                            1
223 Greensboro                                           1
224 Greensburg                                           1
225 Greenville                                           3
226 Greenwich (Township of)                              1
227 Grenada                                              1
228 Griffith                                             1
229 Groveton                                             1
230 Gulfport                                             1
231 Hamilton                                             1
232 Hammond                                              3
233 Hampton                                              1
234 Harrison Township                                    1
235 Harrisville                                          1
236 Hattiesburg                                          1
237 Haverhill                                            1
238 Helena                                               2
239 Helena Valley West Central                           1
240 Hernando                                             1
241 Hickory                                              1
242 Highland                                             1
243 Highland Heights                                     1
244 Hillsboro                                            1
245 Hobbs                                                1
246 Holland                                              1
247 Hollister                                            1
248 Hollywood                                            1
249 Homestead                                            1
250 Hoover                                               1
251 Hopewell (Township of)                               1
252 Hot Springs (Hot Springs National Park)              1
253 Houston                                              2
254 Huntington                                           1
255 Huntsville                                           1
256 Indianapolis                                         1
257 Indianapolis (Remainder)                             4
258 Indio                                                1
259 Iowa City                                            1
260 Ironton                                              1
261 Jackson                                              2
262 Jacksonville                                         2
263 Jamesville                                           1
264 Jasper                                               3
265 Jean                                                 1
266 Jeffersonville                                       1
267 Jenison                                              1
268 Jersey City                                          1
269 Jerseyville                                          1
270 Johnstown                                            1
271 Joliet                                               1
272 Kalamazoo                                            1
273 Kalispell                                            1
274 Kansas City                                          3
275 Keeler                                               1
276 Keene                                                1
277 Kenansville                                          1
278 Kenner                                               1
279 Kennesaw                                             1
280 Keokuk                                               1
281 Kinston                                              1
282 Kokomo                                               1
283 La Crosse                                            1
284 La Grande                                            1
285 Lackawanna                                           1
286 Laconia                                              1
287 Ladue                                                1
288 Lafayette                                            3
289 Lake Charles                                         1
290 Lakeland                                             1
291 Lakeport                                             1
292 Lakeview                                             1
293 Lancaster                                            2
294 Lander                                               1
295 Lansing                                              1
296 Las Cruces                                           1
297 Las Vegas                                            1
298 Laurel                                               1
299 Lawrence                                             1
300 Leander                                              1
301 Lebanon                                              2
302 Leeds                                                1
303 Lexington                                            1
304 Lexington-Fayette (corporate name for Lexington)     2
305 Libby                                                1
306 Liberty                                              1
307 Lincoln                                              1
308 Lindon                                               1
309 Little Rock                                          2
310 Littleton                                            1
311 Live Oak                                             1
312 Livermore                                            1
313 Livonia                                              1
314 Logan                                                1
315 Long Beach                                           2
316 Longmont                                             1
317 Los Angeles                                          1
318 Louisville                                           4
319 Luna Pier                                            1
320 Lynchburg                                            1
321 Lynn                                                 1
322 Lynwood                                              1
323 Macon                                                2
324 Madison                                              1
325 Magna                                                1
326 Mamaroneck                                           1
327 Manistee                                             1
328 Marble City Community                                1
329 Maricopa                                             1
330 Marion                                               2
331 Marrero                                              1
332 Martinsburg                                          1
333 Marysville                                           1
334 McAlester                                            1
335 McCook                                               1
336 McDonald                                             1
337 McLean                                               1
338 Medford                                              1
339 Melbourne                                            1
340 Mena                                                 1
341 Merced                                               1
342 Meridian                                             1
343 Mesa                                                 1
344 Miami                                                2
345 Michigan City                                        1
346 Middlesborough (corporate name for Middlesboro)      1
347 Middletown                                           2
348 Midlothian                                           1
349 Milwaukee                                            5
350 Mingo Junction                                       1
351 Minneapolis                                          2
352 Mira Loma                                            1
353 Mission                                              1
354 Mission Viejo                                        1
355 Missoula                                             1
356 Modesto                                              1
357 Mojave                                               1
358 Monroe                                               1
359 Montgomery                                           1
360 Morgantown                                           1
361 Morristown                                           1
362 Moundsville                                          1
363 Muncie                                               1
364 Muscatine                                            1
365 Muscle Shoals                                        1
366 Muskegon                                             1
367 Muskogee                                             1
368 Naperville                                           1
369 Nashua                                               1
370 Natchez                                              1
371 New Albany                                           1
372 New Haven                                            5
373 New Paris                                            1
374 New York                                             9
375 Newark                                               2
376 Newburgh                                             1
377 Newburgh Heights                                     1
378 Newport                                              1
379 Niagara Falls                                        1
380 Nogales                                              1
381 Norfolk                                              1
382 Normal                                               1
383 Norristown                                           1
384 North Braddock                                       1
385 North Brunswick Township                             1
386 North Charleston                                     1
387 North Las Vegas                                      1
388 North Little Rock                                    1
389 Northbrook                                           1
390 Norwalk                                              1
391 Norwich                                              1
392 Norwood                                              1
393 Not in a city                                      103
394 Oak Park                                             1
395 Oakland                                              2
396 Oakridge                                             1
397 Odessa                                               1
398 Ogden                                                1
399 Ogden Dunes (Wickliffe)                              1
400 Oglesby                                              1
401 Oklahoma City                                        2
402 Omaha                                                2
403 Onamia                                               1
404 Ontario                                              1
405 Orlando                                              1
406 Overland Park                                        1
407 Paducah                                              1
408 Painesville                                          1
409 Palm Springs                                         1
410 Palm Springs North                                   1
411 Panama City                                          1
412 Pasadena                                             1
413 Pascagoula                                           1
414 Paterson                                             1
415 Pawtucket                                            1
416 Peach Springs                                        1
417 Pelham                                               1
418 Pendleton                                            1
419 Pennsauken (Pensauken)                               1
420 Peoria                                               1
421 Phenix City                                          1
422 Philadelphia                                         5
423 Phillipsburg                                         1
424 Phoenix                                              3
425 Pico Rivera                                          1
426 Pikeville                                            1
427 Pinedale                                             1
428 Pinehurst (Pine Creek)                               1
429 Pinson                                               1
430 Piru                                                 1
431 Pittsboro                                            1
432 Pittsburgh                                           1
433 Pittsfield                                           1
434 Platteville                                          1
435 Pleasant Prairie                                     1
436 Pompano Beach                                        1
437 Port Arthur                                          1
438 Port Huron                                           1
439 Portland                                             2
440 Portola                                              1
441 Portsmouth                                           2
442 Potosi                                               1
443 Potsdam                                              1
444 Powder Springs                                       1
445 Prescott Valley                                      1
446 Presque Isle                                         1
447 Providence                                           2
448 Provo                                                1
449 Pryor (corporate name Pryor Creek)                   1
450 Pueblo                                               1
451 Quincy                                               2
452 Rahway                                               1
453 Raleigh                                              1
454 Rapid City                                           2
455 Ravenna                                              1
456 Redding                                              1
457 Redwood City                                         1
458 Reno                                                 1
459 Reseda                                               1
460 Richfield                                            1
461 Richmond                                             1
462 Ridge Wood Heights                                   1
463 Ridgecrest                                           1
464 Rio Rancho Estates                                   1
465 Riverside                                            1
466 Roanoke                                              2
467 Rochester                                            2
468 Rock Island Arsenal (U.S. Army)                      1
469 Rock Springs                                         1
470 Rockford                                             1
471 Rockwell                                             1
472 Rocky Mount                                          1
473 Rome                                                 1
474 Roseville                                            1
475 Roswell                                              1
476 Roxborough Park                                      1
477 Royal Palm Beach                                     1
478 Rubidoux                                             1
479 Russellville                                         1
480 Rutland                                              1
481 Sacramento                                           2
482 Saint Petersburg                                     1
483 Salinas                                              1
484 Salt Lake City                                       2
485 San Andreas                                          1
486 San Antonio                                          2
487 San Bernardino                                       1
488 San Diego                                            2
489 San Francisco                                        1
490 San Jose                                             1
491 San Luis Obispo                                      1
492 Sandersville                                         1
493 Sanford                                              1
494 Santa Barbara                                        1
495 Santa Fe                                             1
496 Santa Maria                                          1
497 Santa Rosa                                           1
498 Sault Ste. Marie                                     2
499 Savannah                                             1
500 Schiller Park                                        1
501 Scottsbluff                                          1
502 Scottsdale                                           1
503 Scranton                                             1
504 Seaford                                              1
505 Searcy                                               1
506 Seattle                                              2
507 Seeley Lake                                          1
508 Seven Oaks                                           1
509 Shakopee                                             1
510 Sharonville                                          1
511 Shasta Lake                                          1
512 Sheffield                                            1
513 Shepherdsville                                       1
514 Sheridan                                             2
515 Shreveport                                           1
516 Silver City                                          1
517 Simi Valley                                          1
518 Sioux City                                           1
519 Sioux Falls                                          2
520 Soddy-Daisy                                          1
521 South Bend                                           2
522 South Charleston                                     1
523 South Padre Island                                   1
524 Spanish Fork                                         1
525 Spokane                                              1
526 Springdale                                           1
527 Springfield                                          5
528 Spruce Pine                                          1
529 St. Bernard                                          1
530 St. Cloud                                            1
531 St. Joseph                                           1
532 St. Louis                                            4
533 St. Louis Park                                       1
534 St. Paul                                             3
535 State College                                        1
536 Ste. Genevieve                                       1
537 Steubenville                                         1
538 Stockton                                             1
539 Stuttgart                                            1
540 Summit                                               1
541 Suncook                                              1
542 Swansea                                              1
543 Tacoma                                               1
544 Tallahassee                                          1
545 Tampa                                                1
546 Taylors                                              1
547 Tecumseh                                             1
548 Terre Haute                                          2
549 Texarkana                                            1
550 Theodore                                             1
551 Thomaston                                            1
552 Thompson Falls                                       1
553 Thousand Oaks                                        1
554 Toledo                                               3
555 Toms River                                           1
556 Tooele                                               1
557 Topeka                                               1
558 Trenton                                              1
559 Truckee                                              1
560 Tucson                                               2
561 Tulsa                                                2
562 Tupelo                                               1
563 Tuscaloosa                                           1
564 Ukiah                                                1
565 Underhill (Town of)                                  1
566 Union City                                           1
567 Valdosta                                             1
568 Vallejo                                              1
569 Valrico                                              1
570 Vancouver                                            1
571 Victorville                                          1
572 Vienna                                               1
573 Vinton                                               1
574 Virginia                                             1
575 Virginia Beach                                       1
576 Visalia                                              1
577 Warner Robins                                        1
578 Warren                                               1
579 Washington                                           4
580 Waterbury                                            1
581 Waterloo                                             1
582 Watertown                                            1
583 Waukesha                                             1
584 Waynesville                                          1
585 Weirton                                              2
586 West Orange                                          1
587 West Yellowstone                                     1
588 Westfield                                            1
589 Westport                                             1
590 Wheeling                                             1
591 Whitefish                                            1
592 Wichita                                              3
593 Wilmington                                           1
594 Winston-Salem                                        2
595 Winter Park                                          1
596 Wood River                                           1
597 Woodland                                             1
598 Worcester                                            2
599 Wyandotte                                            1
600 Yakima                                               1
601 Yellow Springs                                       1
602 York                                                 1
603 Youngstown                                           2
604 Ypsilanti                                            1
605 Yuba City                                            1
606 Yuma                                                 1
607 Zion                                                 1

Note: Recall that you must specify retain = TRUE argument of the prep() function to use juice().

Extracting the preprocessed testing data

According to the tidymodels documentation:

bake() takes a trained recipe and applies the operations to a data set to create a design matrix. for example: it applies the centering to new data sets using these means used to create the recipe

If you wanted to look at the preprocessed testing data you would use the bake() function of the recipes package. (You generally want to leave your testing data alone, but it is good to look for issues like the introduction of NA values).

Rows: 292
Columns: 36
$ id                          <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value                       <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips                        <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat                         <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon                         <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ                        <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area                   <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop                    <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500                    <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000                  <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area                 <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop                  <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec          <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000         <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000        <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500       <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000      <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000      <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000     <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county              <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta                <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs                        <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs                      <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs                          <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege                 <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate                   <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor                    <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad                        <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov                         <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless                   <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013                     <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod                         <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_California            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ city_Not.in.a.city          <dbl> NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA…

Notice that our city_Not.in.a.city variable seems to be NA values. Why might that be?

Ah! Perhaps it is because some of our levels were not previously seen in the training set!

Let’s take a look using the set operations of the dplyr package. We can take a look at cities that were different between the test and training set.

[1] 376   1
[1] 51  1

Indeed, there are lots of different cities in our test data that are not in our training data!

Maybe remove this?: Thus we need to update our original recipe to include a very important step function called step_novel() this helps in cases like this were there are new factors in our testing set that were not in our training set. It is a good idea to include this in most of your recipes where you have a categorical variables with many distinct values. This step needs to come before we create dummy variables. However, we are also creating a dummy variable from this, which still results in a problem.

Let’s modify the city variable to be values of in a city or not in a city using the if_else() function of dplyr. Alternatively you could create a custom step function to do this and add the step function to your recipe, but that is beyond the scope of this case study.

We need to create a new recipe to move forward, as the levels of our variables are established then. We would also potentially have this issue for state and county. So let’s also do a similar thing for state. The county variables appears to get dropped due to either correlation or near zero variance. It is likely due to near zero variance because this is the more granular of these geographic categorical variables and likely sparse.

<Analysis/Assess/Total>
<584/292/876>

Now let’s retrain our training data and try baking our test data:

oper 1 step dummy [training] 
oper 2 step corr [training] 
oper 3 step nzv [training] 
The retained training set is ~ 0.26 Mb  in memory.

Rows: 584
Columns: 37
$ id                          <fct> 1003.001, 1027.0001, 1033.1002, 1055.001,…
$ value                       <dbl> 9.597647, 10.800000, 11.212174, 12.375394…
$ fips                        <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073,…
$ lat                         <dbl> 30.49800, 33.28126, 34.75878, 33.99375, 3…
$ lon                         <dbl> -87.88141, -85.80218, -87.65056, -85.9910…
$ CMAQ                        <dbl> 8.098836, 9.766208, 9.402679, 9.241744, 9…
$ zcta_area                   <dbl> 190980522, 374132430, 16716984, 154069359…
$ zcta_pop                    <dbl> 27829, 5103, 9042, 20045, 30217, 9010, 16…
$ imp_a500                    <dbl> 0.01730104, 1.96972318, 19.17301038, 16.4…
$ imp_a15000                  <dbl> 1.4386207, 0.3359198, 5.2472094, 5.161210…
$ county_area                 <dbl> 4117521611, 1564252280, 1534877333, 13856…
$ county_pop                  <dbl> 182265, 13932, 54428, 104430, 101547, 658…
$ log_dist_to_prisec          <dbl> 4.648181, 7.219907, 5.760131, 5.261457, 7…
$ log_pri_length_5000         <dbl> 8.517193, 8.517193, 8.517193, 9.066563, 8…
$ log_pri_length_25000        <dbl> 11.32735, 10.12663, 10.15769, 12.01356, 1…
$ log_prisec_length_500       <dbl> 7.295356, 6.214608, 8.611945, 8.740680, 6…
$ log_prisec_length_1000      <dbl> 8.195119, 7.600902, 9.735569, 9.627898, 7…
$ log_prisec_length_5000      <dbl> 10.815042, 10.170878, 11.770407, 11.72888…
$ log_prisec_length_10000     <dbl> 11.886803, 11.405543, 12.840663, 12.76827…
$ log_prisec_length_25000     <dbl> 13.41395, 12.79980, 13.79973, 13.70026, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 4.462…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 4.678311, 3…
$ popdens_county              <dbl> 44.265706, 8.906492, 35.460814, 75.367038…
$ popdens_zcta                <dbl> 145.7164307, 13.6395554, 540.8870404, 130…
$ nohs                        <dbl> 3.3, 11.6, 7.3, 4.3, 5.8, 7.1, 2.7, 11.1,…
$ somehs                      <dbl> 4.9, 19.1, 15.8, 13.3, 11.6, 17.1, 6.6, 1…
$ hs                          <dbl> 25.1, 33.9, 30.6, 27.8, 29.8, 37.2, 30.7,…
$ somecollege                 <dbl> 19.7, 18.8, 20.9, 29.2, 21.4, 23.5, 25.7,…
$ associate                   <dbl> 8.2, 8.0, 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, …
$ bachelor                    <dbl> 25.3, 5.5, 12.7, 10.0, 13.7, 5.9, 17.6, 7…
$ grad                        <dbl> 13.5, 3.1, 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, …
$ pov                         <dbl> 6.1, 19.5, 19.0, 8.8, 15.6, 25.5, 7.3, 8.…
$ hs_orless                   <dbl> 33.3, 64.6, 53.7, 45.4, 47.2, 61.4, 40.0,…
$ urc2013                     <dbl> 4, 6, 4, 4, 4, 1, 1, 1, 1, 1, 2, 3, 3, 3,…
$ aod                         <dbl> 37.363636, 34.818182, 36.000000, 43.41666…
$ state_Not.California        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…

Notice, it looks like we gained the log_prisec_length_25000 back with this recipe using the data with our changes to state and city.

Rows: 292
Columns: 37
$ id                          <fct> 1049.1003, 1073.101, 1073.2006, 1089.0014…
$ value                       <dbl> 11.659091, 13.114545, 12.228125, 12.23294…
$ fips                        <fct> 1049, 1073, 1073, 1089, 1103, 1121, 4013,…
$ lat                         <dbl> 34.28763, 33.54528, 33.38639, 34.68767, 3…
$ lon                         <dbl> -85.96830, -86.54917, -86.81667, -86.5863…
$ CMAQ                        <dbl> 8.534772, 9.303766, 10.235612, 9.343611, …
$ zcta_area                   <dbl> 203836235, 148994881, 56063756, 46963946,…
$ zcta_pop                    <dbl> 8300, 14212, 32390, 21297, 30545, 7713, 5…
$ imp_a500                    <dbl> 5.78200692, 0.06055363, 42.42820069, 23.2…
$ imp_a15000                  <dbl> 0.9730444, 2.9956557, 12.7487614, 10.3555…
$ county_area                 <dbl> 2012662359, 2878192209, 2878192209, 20761…
$ county_pop                  <dbl> 71109, 658466, 658466, 334811, 119490, 82…
$ log_dist_to_prisec          <dbl> 3.721489, 7.301545, 4.721755, 4.659519, 6…
$ log_pri_length_5000         <dbl> 8.517193, 9.683336, 10.737240, 8.517193, …
$ log_pri_length_25000        <dbl> 11.90959, 12.53777, 12.99669, 11.47391, 1…
$ log_prisec_length_500       <dbl> 7.310155, 6.214608, 7.528913, 8.760549, 6…
$ log_prisec_length_1000      <dbl> 8.585843, 7.600902, 9.342290, 9.543183, 8…
$ log_prisec_length_5000      <dbl> 10.214200, 11.262645, 11.713190, 11.48606…
$ log_prisec_length_10000     <dbl> 11.50894, 12.14101, 12.53899, 12.68440, 1…
$ log_prisec_length_25000     <dbl> 13.55979, 14.08915, 14.27363, 13.87170, 1…
$ log_nei_2008_pm10_sum_15000 <dbl> 3.3500444, 6.6241114, 5.8268686, 3.861625…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.1719202, 7.5490587, 8.8205542, 5.219092…
$ popdens_county              <dbl> 35.330814, 228.777633, 228.777633, 161.26…
$ popdens_zcta                <dbl> 40.718962, 95.385827, 577.735106, 453.475…
$ nohs                        <dbl> 14.3, 7.2, 0.8, 1.2, 4.8, 16.7, 19.1, 6.4…
$ somehs                      <dbl> 16.7, 12.2, 2.6, 3.1, 7.8, 33.3, 15.6, 9.…
$ hs                          <dbl> 35.0, 32.2, 12.9, 15.1, 28.7, 37.5, 26.5,…
$ somecollege                 <dbl> 14.9, 19.0, 17.9, 20.5, 25.0, 12.5, 18.0,…
$ associate                   <dbl> 5.5, 6.8, 5.2, 6.5, 7.5, 0.0, 6.0, 8.8, 3…
$ bachelor                    <dbl> 7.9, 14.8, 35.5, 30.4, 18.2, 0.0, 10.6, 1…
$ grad                        <dbl> 5.8, 7.7, 25.2, 23.3, 8.0, 0.0, 4.1, 5.7,…
$ pov                         <dbl> 13.8, 10.5, 2.1, 5.2, 8.3, 18.8, 21.4, 14…
$ hs_orless                   <dbl> 66.0, 51.6, 16.3, 19.4, 41.3, 87.5, 61.2,…
$ urc2013                     <dbl> 6, 1, 1, 3, 4, 5, 1, 2, 5, 4, 4, 6, 6, 1,…
$ aod                         <dbl> 33.08333, 42.45455, 44.25000, 42.41667, 4…
$ state_Not.California        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
$ city_Not.in.a.city          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Great now we no longer have NA values! :)

Note: if you use the skip option for some of the preprocessing steps, be careful. juice() will show all of the results ignoring skip = TRUE. bake() will not necessarily conduct these steps on the new data.

Specifying the Model

So far we have used rsample to split the data and recipes to assign variable and to specify and prep our preprocessing (as well as to optionally extract the preprocessed data).

We will now use the parsnip package (which is similar to the previous caret package - and hence why it is named after the vegetable) to specify our model.

There are four aspects to define about our model:
1) the type of model (using specific functions in parsnip like rand_forest(), logistic_reg() etc.)
2) the mode of learning - classification or regression (using the set_mode() function)
3) the package or engine that we will use to implement the type of model selected (using the set_engine() function)
4) any arguments necessary for the model/package selected (using the set_args()function - for example the mtry = argument for random forest which is the number of variables to be used as options for splitting at each tree node)

We are going to start our analysis with a linear regression but we will demonstrate how we can try different models.

The first thing we do is define what type of model we would like to use. See here for modeling [options]in parsnip.

Linear Regression Model Specification (regression)

OK. So far all we have told parsnip is we want to use a linear regression… Let’s tell parsnip more about what we want.

We would like to use the ordinary least squares method to fit our linear regression. So we will tell parsnip that we want to use the lm package to implement our linear regression (there are many options actually- such as rstan glmnet, keras, and sparklyr). We will do so by using the set_engine() function of the parsnip package.

Linear Regression Model Specification (regression)

Computational engine: lm 

In some cases some packages can do either classification or prediction, so it is a good idea to specify which mode you intend to perform. You can do this with the set_mode() function of the parsnip package, by using either set_mode("classification") or set_mode("regression").

Linear Regression Model Specification (regression)

Computational engine: lm 

Fitting the Model: two ways - workflows and parsnip

To fit our model we can use the parsnip package and then assess our fit using the yardstick package.

However a newer package called workflows allows us to keep track of both our preprocessing steps and our model specification. It also allows us to implement fancier optimizations in an automated way and it is currently being developed to also handle post-processing operations, so it is good to learn about it!

So we will now create a workflow with the recipe (our preprocessing specifications) that we made and the model that we just specified.

First we use the workflow() function of the workflows package to create a workflow.

Then we add our recipe with the add_recipe() function and we add our model with the add_model() function of the workflows package.

Note: We do not need to actually prep our recipe before using workflows!

══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

● step_dummy()
● step_corr()
● step_nzv()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Ah, nice. Notice how it tells us about both our preprocessing steps and our model specifications.

Now we can prepare the recipe (estimate the parameters) and fit the model to our training data all at once. Printing the output we can see the coefficients of the model.

══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

● step_dummy()
● step_corr()
● step_nzv()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = formula, data = data)

Coefficients:
                (Intercept)                          lat  
                  2.265e+02                    3.639e-02  
                        lon                         CMAQ  
                  2.579e-02                    2.847e-01  
                  zcta_area                     zcta_pop  
                  3.638e-10                    7.880e-06  
                   imp_a500                   imp_a15000  
                  7.453e-03                   -1.140e-03  
                county_area                   county_pop  
                 -2.116e-11                   -2.156e-07  
         log_dist_to_prisec          log_pri_length_5000  
                  4.064e-02                   -1.842e-01  
       log_pri_length_25000        log_prisec_length_500  
                  7.936e-03                    2.572e-01  
     log_prisec_length_1000       log_prisec_length_5000  
                 -2.288e-02                    4.743e-01  
    log_prisec_length_10000      log_prisec_length_25000  
                 -1.410e-01                    4.658e-01  
log_nei_2008_pm10_sum_15000  log_nei_2008_pm10_sum_25000  
                  1.088e-01                    5.255e-02  
             popdens_county                 popdens_zcta  
                 -4.114e-05                   -1.816e-05  
                       nohs                       somehs  
                 -2.234e+00                   -2.270e+00  
                         hs                  somecollege  
                 -2.271e+00                   -2.274e+00  
                  associate                     bachelor  
                 -2.272e+00                   -2.277e+00  
                       grad                          pov  
                 -2.284e+00                    7.001e-03  
                  hs_orless                      urc2013  
                         NA                    2.153e-01  
                        aod         state_Not.California  
                  2.555e-02                   -3.445e+00  
         city_Not.in.a.city  
                  3.550e-01  

Otherwise we could have done this without the workflows package. Notice here we will used the processed training data (juiced_train) as opposed to the raw training data that we used with the workflow we created with workflows.

In this case, we actually need to write your model again! Recall that id and fips are ID variables and that values is our outcome of interest (the pm air pollution measure at each monitor).

Looking at model fit with broom

The broom package allows for an easy/tidy way to look at the fitted model:

tidy() grabs the coefficients from the model
glance() summarizes the model fit and gives us an idea about how well the model might perform augment() gives a 150 row observation level summary of the data and fit

These broom functions currently only work with parsnip objects not raw workflows objects. To use the tidy() function with workflows we need to first use the pull_workflow_fit() function.

# A tibble: 34 x 5
   term                     estimate    std.error statistic  p.value
   <chr>                       <dbl>        <dbl>     <dbl>    <dbl>
 1 state_Not.California -3.44        0.436            -7.91 1.45e-14
 2 CMAQ                  0.285       0.0430            6.61 8.90e-11
 3 aod                   0.0256      0.00575           4.45 1.06e- 5
 4 lon                   0.0258      0.00998           2.58 1.00e- 2
 5 county_pop           -0.000000216 0.0000000934     -2.31 2.14e- 2
 6 urc2013               0.215       0.101             2.13 3.35e- 2
 7 grad                 -2.28        1.20             -1.90 5.78e- 2
 8 bachelor             -2.28        1.20             -1.90 5.84e- 2
 9 somecollege          -2.27        1.20             -1.89 5.87e- 2
10 associate            -2.27        1.20             -1.89 5.90e- 2
# … with 24 more rows
# A tibble: 1 x 11
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
1     0.433         0.399  2.08      12.7 1.02e-48    34 -1240. 2550. 2703.
# … with 2 more variables: deviance <dbl>, df.residual <int>
# A tibble: 584 x 42
   value   lat   lon  CMAQ zcta_area zcta_pop imp_a500 imp_a15000 county_area
   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>    <dbl>      <dbl>       <dbl>
 1  9.60  30.5 -87.9  8.10 190980522    27829   0.0173      1.44   4117521611
 2 10.8   33.3 -85.8  9.77 374132430     5103   1.97        0.336  1564252280
 3 11.2   34.8 -87.7  9.40  16716984     9042  19.2         5.25   1534877333
 4 12.4   34.0 -86.0  9.24 154069359    20045  16.5         5.16   1385618994
 5 10.5   31.2 -85.4  9.12 162685124    30217  19.1         4.74   1501737720
 6 15.6   33.6 -86.8 10.2   26929603     9010  41.8        17.5    2878192209
 7 12.4   33.3 -87.0 10.2  166239542    16140   1.70        4.30   2878192209
 8 11.1   33.5 -87.3  8.16 385566685     3699   0           0.162  3423328940
 9 14.6   33.5 -86.9 10.2   10636977    11458  43.6        15.6    2878192209
10 12.0   33.7 -86.7  9.30 150661846    21725   1.48        4.25   2878192209
# … with 574 more rows, and 33 more variables: county_pop <dbl>,
#   log_dist_to_prisec <dbl>, log_pri_length_5000 <dbl>,
#   log_pri_length_25000 <dbl>, log_prisec_length_500 <dbl>,
#   log_prisec_length_1000 <dbl>, log_prisec_length_5000 <dbl>,
#   log_prisec_length_10000 <dbl>, log_prisec_length_25000 <dbl>,
#   log_nei_2008_pm10_sum_15000 <dbl>, log_nei_2008_pm10_sum_25000 <dbl>,
#   popdens_county <dbl>, popdens_zcta <dbl>, nohs <dbl>, somehs <dbl>,
#   hs <dbl>, somecollege <dbl>, associate <dbl>, bachelor <dbl>, grad <dbl>,
#   pov <dbl>, hs_orless <dbl>, urc2013 <dbl>, aod <dbl>,
#   state_Not.California <dbl>, city_Not.in.a.city <dbl>, .fitted <dbl>,
#   .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>
[1] TRUE

OK, so we have fit our model on our training data, which means we have created a model to predict values of air pollution based on the predictors that we have included. Yay!

We can get a sense of the variable importance using the vip() function of the vip package.

Let’s take a look at the top 10 contributing variables:

Model Performance

Let’s take a look at how well our model fit our training data:

        1         2         3         4         5         6 
 9.461664 10.429189 11.795351 11.139746 10.863402 11.091857 
# A tibble: 6 x 8
  value .fitted .se.fit .resid   .hat .sigma    .cooksd .std.resid
  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
1  9.60    9.46   0.375  0.136 0.0324   2.09 0.00000433     0.0663
2 10.8    10.4    0.383  0.371 0.0338   2.09 0.0000338      0.181 
3 11.2    11.8    0.404 -0.583 0.0376   2.09 0.0000936     -0.285 
4 12.4    11.1    0.388  1.24  0.0346   2.08 0.000384       0.604 
5 10.5    10.9    0.426 -0.355 0.0418   2.09 0.0000388     -0.174 
6 15.6    11.1    0.379  4.50  0.0332   2.08 0.00486        2.20  
        1         2         3         4         5         6 
 9.461664 10.429189 11.795351 11.139746 10.863402 11.091857 
# A tibble: 6 x 8
  value .fitted .se.fit .resid   .hat .sigma    .cooksd .std.resid
  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
1  9.60    9.46   0.375  0.136 0.0324   2.09 0.00000433     0.0663
2 10.8    10.4    0.383  0.371 0.0338   2.09 0.0000338      0.181 
3 11.2    11.8    0.404 -0.583 0.0376   2.09 0.0000936     -0.285 
4 12.4    11.1    0.388  1.24  0.0346   2.08 0.000384       0.604 
5 10.5    10.9    0.426 -0.355 0.0418   2.09 0.0000388     -0.174 
6 15.6    11.1    0.379  4.50  0.0332   2.08 0.00486        2.20  
[1] TRUE

OK, so our fitted range appears to be smaller than the real values. We could probably do a bit better.

Let’s take a look at how well our model seems to be preforming more formally:

When assessing the performance of a model, the metrics we use depend on if we are preforming a classification or prediction also known as regression analysis. In our case we are performing a regression or prediction analysis and the metrics often used are: 1) mean absolute error (mae)
2) R squared error (rsq) This is also known as the coefficient of determination which is the squared correlation between truth and estimate
3) root mean squared error (rmse)

We can use the yardstick package to quickly calculate estimates for all of these values using the metrics() function. Alternatively if you only wanted one metric you could use the mae(), rsq(), or rmse() functions respectively. This is helpful to examine with our fitted training set values to see how well our model is performing and if we need to make adjustments.

# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       2.02 
2 rsq     standard       0.433
3 mae     standard       1.49 
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        1.49

Cross validation sample splitting

We will use the rsample package again in order to further implement what are called cross validation techniques. This is also called resampling or repartioning.

Note: we are not actually getting new samples from the underlying distribution so the term resampling is a bit of a misnomer.

Cross validation splits our training data into multiple training data sets to allow for a deeper assessment of the accuracy of the model.

Here is a visualization of the concept for cross validation/resampling/repartitioning from Max Kuhn:

Technically creating our testing and training set out of our original training data is sometimes considered a form of cross validation called the holdout method. As we just learned this can give us a better sense of the accuracy of our data in a more generalizable way.

However, we can do a better job of optimizing our model for accuracy if we also perform another type of cross validation on the newly defined training set that we just created. There are many cross validation methods and most can be easily implemented using rsamples package. We will use a very popular method called either k-fold or v-fold cross validation.

This method involves essentially preforming the hold out method iteratively with the training data.

First the training set is divided into k or v equally sized smaller pieces.

Then the model is trained on the model on k-1 or v-1 subsets of the data iteratively (removing a different v or k until all possible k-1 or v-1 sets have been evaluated) to get a sense of the performance of the model. This is really useful for fine tuning specific aspects of the model in a process called model tuning.

Here is a visualization of how the folds are created:

Note: People typically ignore spatial dependence with cross validation of air pollution monitoring data in the air pollution field, so we will do the same. However, it might make sense to leave out blocks of monitors rather than random individual monitors to help account for some spatial dependence.

The vfold_cv() function of the rsample package can be used to parse the training data into folds for k-fold/v-fold cross validation.

The v argument specifies the number of folds to create. The repeats argument specifies if any samples should be repeated across folds - default is FALSE The strata argument specifies a variable to stratify samples across folds (just like in initial_split()).

Again because these are created at random, we need to use the base set.seed() function in order to obtain the same results each time we knit this document. Generally speaking using 10 folds is good practice, but this depends on the variablity within your data. We are going to use 4 for the sake of expediency.

#  4-fold cross-validation 
# A tibble: 4 x 2
  splits            id   
  <list>            <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4
NULL
NULL

Once the folds are created they can be used to evaluate performance by fitting the model to each of the resamples that we created:

We can fit the model to our cross validation folds using the fit_resamples() function of the tune package, by specifying our workflow object and the cross validation fold object we just created. See here for more information.

We can now take a look at various metrics of performance based on the fit of our cross validation “resamples”. To do this we will use the show_best() function of the tune package.

# A tibble: 1 x 5
  .metric .estimator  mean     n std_err
  <chr>   <chr>      <dbl> <int>   <dbl>
1 rmse    standard    2.18     4  0.0763

Tuning

Now let’s try some tuning.

Let’s take a closer look at how the air pollution monitor values vary with the location latitude and longitude.

We can see that there does not appear to be a single linear relationship for either of these predictors. Thus we might want to think about using splines or this(https://towardsdatascience.com/numerical-interpolation-natural-cubic-spline-52c1157b98ac) or just this(https://tidymodels.github.io/tune/articles/getting_started.html) or this(https://www.psych.mcgill.ca/misc/fda/ex-basis-b1.html) to model the relationship in our training data more closely. For example for the latitude plot (left) if we had 2 lines and one break-point called a knot around 40, with the first line having a positive slope and the second with a negative slope this would fit the data more similarly to the blue line shown in the figure.

We can tune for the number of knots by using a step function in the recipes package called step_ns() where ns stands for natural splines. In order to tune for the number of knots or degrees of freedom, we can set the deg_free argument to tune(). This is helpful, becuase we aren’t exactly sure how closely we should be following the relationship with the value and our longitude and latitude data in our training data to achieve good accuracy yet keep our model generalizable for other data.

This is when our cross validation methods become really handy. We can test out different values for the deg_free argument and see how our model performance varies across our training folds to try to find the optimal value.

We will update our recipe to add these steps. It is a good idea to do this for individual predictors because you can name each with the tune argument so that you can keep track of it later. We can see what we intend to tune with the parameters() function of the dials package.

See here for more information about implementing this in tidymodels.

Collection of 2 parameters for tuning

     id parameter type object class
 lon df       deg_free    nparam[+]
 lat df       deg_free    nparam[+]

Generally you could use the grid_*() functions of the dials package to create the different combinations of degrees of freedom to test for both variables to optimize the model. In our case we can visibly see that if we add more than say 4 or 5 degrees of freedom we will likely over-fit the data. So instead of using these functions we will create our own grid using the base seq() and expand.grid() functions.

  lon df lat df
1      1      1
2      3      1
3      5      1
4      1      3
5      3      3
6      5      3
7      1      5
8      3      5
9      5      5

Now we will tune this hyper-parameter (degrees of freedom) for both the lat and lon variables using our cross validation folds. To do this we will use the tune_grid() function of the tune package.

#  4-fold cross-validation 
# A tibble: 4 x 4
  splits            id    .metrics          .notes           
  <list>            <chr> <list>            <list>           
1 <split [438/146]> Fold1 <tibble [18 × 5]> <tibble [18 × 1]>
2 <split [438/146]> Fold2 <tibble [18 × 5]> <tibble [18 × 1]>
3 <split [438/146]> Fold3 <tibble [18 × 5]> <tibble [18 × 1]>
4 <split [438/146]> Fold4 <tibble [18 × 5]> <tibble [18 × 1]>
# A tibble: 18 x 7
   `lon df` `lat df` .metric .estimator  mean     n std_err
      <dbl>    <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
 1        1        1 rmse    standard   2.18      4  0.0763
 2        1        1 rsq     standard   0.361     4  0.0486
 3        1        3 rmse    standard   2.13      4  0.0807
 4        1        3 rsq     standard   0.383     4  0.0388
 5        1        5 rmse    standard   2.13      4  0.0812
 6        1        5 rsq     standard   0.386     4  0.0403
 7        3        1 rmse    standard   2.10      4  0.0613
 8        3        1 rsq     standard   0.402     4  0.0296
 9        3        3 rmse    standard   2.05      4  0.0709
10        3        3 rsq     standard   0.428     4  0.0191
11        3        5 rmse    standard   2.03      4  0.0656
12        3        5 rsq     standard   0.439     4  0.0197
13        5        1 rmse    standard   2.11      4  0.0616
14        5        1 rsq     standard   0.397     4  0.0258
15        5        3 rmse    standard   2.05      4  0.0739
16        5        3 rsq     standard   0.427     4  0.0164
17        5        5 rmse    standard   2.02      4  0.0717
18        5        5 rsq     standard   0.442     4  0.0153
# A tibble: 1 x 7
  `lon df` `lat df` .metric .estimator  mean     n std_err
     <dbl>    <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
1        5        5 rmse    standard    2.02     4  0.0717

Linear Regression Model with PCA

We can create another workflow to see how model performance compares using a different model. In this case we are going to perform something called Principal Component Analysis or PCA.

So what is PCA?

PCA is a widely used dimensionality reduction method (a form of unsupervised machine learning). It creates new variables that capture the most variation within the data, yet reduce the data down to just a number of principal components. It does so by transforming the data using orthogonal%20%3D%200.){target="_blank"} linear transformation. In other words, it creates new variables that are linear combinations of the variables within the data. Importantly these new variables are orthogonal, meaning that the new variables have zero covariance. In simpler terms, we are expressing unique types of variation within the data as new variables.

Check out this video for more information.

Let’s take a look to see what the step_pca function does to our predictors. To do so recall that we need to use the prep and juice functions of the recipes package on our recipe.

oper 1 step dummy [training] 
oper 2 step pca [training] 
The retained training set is ~ 0.13 Mb  in memory.
Rows: 584
Columns: 8
$ id    <fct> 1003.001, 1027.0001, 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value <dbl> 9.597647, 10.800000, 11.212174, 12.375394, 10.508850, 15.591017…
$ fips  <fct> 1003, 1027, 1033, 1055, 1069, 1073, 1073, 1073, 1073, 1073, 107…
$ PC1   <dbl> -4120232815, -1570558844, -1534934869, -1388102571, -1504354268…
$ PC2   <dbl> -118909004, -346706197, 10140574, -129802253, -136385028, 23433…
$ PC3   <dbl> -72827.994, 7132.812, -56957.316, 45519.861, 36487.258, 448189.…
$ PC4   <dbl> -20385.63273, 1615.38773, -7187.37037, -15876.58640, -25852.952…
$ PC5   <dbl> -846.027447, 90.287809, 171.426032, -576.131718, -926.823714, -…

We still want to use the lm package for our regression so we can use the same model object as before:

Linear Regression Model Specification (regression)

Computational engine: lm 
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
2 Recipe Steps

● step_dummy()
● step_pca()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Remember that using workflows we don’t actually need to prep our recipe, we can just fit our model directly.

Fit the cross validation samples:

Look at the performance:

# A tibble: 2 x 5
  .metric .estimator   mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 rmse    standard   2.76       4  0.271 
2 rsq     standard   0.0527     4  0.0184

And we can compare this with our previous performance:

# A tibble: 2 x 5
  .metric .estimator  mean     n std_err
  <chr>   <chr>      <dbl> <int>   <dbl>
1 rmse    standard   2.18      4  0.0763
2 rsq     standard   0.361     4  0.0486

So we can see that our performance isn’t quite as good - especially if we look at the rmse value.

Random Forest

Now for one last recipe, we are going to predict using a decision tree method called random forest.

A decision tree is a tool to partition data or anything really, based on a series of sequential (often binary) decisions, where the decisions are chosen based on their ability to optimally split the data.

Here you can see a simple example:

source

In the case of random forest, multiple decision trees are created - hence the name forest, and each tree is built using a random subset of the training data (with replacement) - hence the full name random forest. This random aspect helps to keep the algorithm from overfitting the data.

The mean of the predictions from each of the trees is used in the final output.

In our case, the random forest algorithm that we are working with does not work well when there are categorical variables with more than 53 levels, so we will need to remove the zcta variable.

The rand_forest() function of the parsnip package has three important arguments that act as an interface for the different possible engines to perform a random forest analysis:

  1. mtry The number of predictor or explanatory variables that will be randomly sampled as options at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3.

  2. min_n - The minimum number of data points in a node that are required for the node to be split further.

  3. trees - the number of trees in the ensemble 10 and 3

Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 
══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
7 Recipe Steps

● step_string2factor()
● step_rm()
● step_rm()
● step_rm()
● step_rm()
● step_corr()
● step_nzv()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  min_n = 3

Computational engine: randomForest 

Fitting the data with just parsnip and with the workflow:

══ Workflow [trained] ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
7 Recipe Steps

● step_string2factor()
● step_rm()
● step_rm()
● step_rm()
● step_rm()
● step_corr()
● step_nzv()

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:
 randomForest(x = as.data.frame(x), y = y, mtry = ~10, nodesize = ~3) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 10

          Mean of squared residuals: 3.016711
                    % Var explained: 58.14

Let’s take a look at the top 10 contributing variables:

Interesting, in the previous model the CMAQ values were also important, however the variable about if the monitor was located in California or not was also very predictive.

Now let’s take a look at model performance by fitting the data using cross validation:

# A tibble: 2 x 5
  .metric .estimator  mean     n std_err
  <chr>   <chr>      <dbl> <int>   <dbl>
1 rmse    standard   1.79      4  0.105 
2 rsq     standard   0.580     4  0.0377

Now let’s compare the performance of this model with the others:

# A tibble: 2 x 5
  .metric .estimator  mean     n std_err
  <chr>   <chr>      <dbl> <int>   <dbl>
1 rmse    standard   2.18      4  0.0763
2 rsq     standard   0.361     4  0.0486
# A tibble: 1 x 7
  `lon df` `lat df` .metric .estimator  mean     n std_err
     <dbl>    <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
1        5        5 rmse    standard    2.02     4  0.0717
# A tibble: 2 x 5
  .metric .estimator   mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 rmse    standard   2.76       4  0.271 
2 rsq     standard   0.0527     4  0.0184

OK, so our first model had a mean rmse value of 2.18. The model with the lat/long degrees of freedom tuning had a mean rmse value of 2.02, thus showing some improvement. The PCA model had a mean rmse value of 2.76.

It looks like the random forest model had the lowest rmse value of 1.79.

If we tuned our random forest model based on the number of trees or the value for mtry (which is “The number of predictors that will be randomly sampled at each split when creating the tree models”), we might get a model with even better performance.

However, our cross validated mean rmse value of 1.79 is quite good because our range of true outcome values is much larger:3.4963303, 22.259123.

Final Model Performance Evaluation

Now that we have decided that we have reasonable performance with the training data, we could stop here and use the yardstick package (and tune if using workflows to fit our model) to evaluate performance with our testing data.

So now we will use our random forest model to predict values for the monitors in the testing data.

Using parsnip we would need to use the baked data testing data. With the workflows package, we could use the raw testing data.

Importantly, ID variables are not dealt with as nicely as with the workflows package so we would need to remove them. We did this above when created the processed training data for this model, the juiced_train data as well.

# A tibble: 292 x 1
   .pred
   <dbl>
 1 11.1 
 2 11.8 
 3 12.0 
 4 11.3 
 5 11.8 
 6 11.9 
 7 10.6 
 8 10.6 
 9  8.65
10  7.60
# … with 282 more rows
# A tibble: 292 x 5
   .pred value fips  county     id       
   <dbl> <dbl> <fct> <chr>      <fct>    
 1 11.1  11.7  1049  DeKalb     1049.1003
 2 12.0  13.1  1073  Jefferson  1073.101 
 3 11.6  12.2  1073  Jefferson  1073.2006
 4 11.3  12.2  1089  Madison    1089.0014
 5 11.5  11.4  1103  Morgan     1103.0011
 6 12.0  12.2  1121  Talladega  1121.0002
 7 10.8  10.9  4013  Maricopa   4013.4003
 8 10.7  10.6  4021  Pinal      4021.0001
 9  8.60 14.1  4023  Santa Cruz 4023.0004
10  7.59  5.83 4025  Yavapai    4025.2002
# … with 282 more rows
# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.49 
2 rsq     standard       0.612
3 mae     standard       1.06 

Awesome! We can see that our rmse of 1.49 is quite similar with our testing data. We achieved quite good performance, which suggests that we would could predict other locations with more sparse monitoring based on our predictors with reasonable accuracy.

We could also use the last_fit() function of the tune package to look at performance if we chose to create a workflow using the workflows package.

# A tibble: 2 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.49 
2 rsq     standard       0.607

We could check out test performance using the collect_metrics() function of the tune package.

# A tibble: 2 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.49 
2 rsq     standard       0.607

Here you can see the predictions for the test set (the 292 rows with predictions out of the 876 original monitor values) also using the tune package with the collect_predictions() function.

# A tibble: 292 x 4
    id               .pred  .row value
    <chr>            <dbl> <int> <dbl>
  1 train/test split 11.1      4 11.7 
  2 train/test split 11.9     10 13.1 
  3 train/test split 11.8     12 12.2 
  4 train/test split 11.4     15 12.2 
  5 train/test split 11.5     19 11.4 
  6 train/test split 12.1     22 12.2 
  7 train/test split 10.6     30 10.9 
  8 train/test split 10.7     36 10.6 
  9 train/test split  8.54    39 14.1 
 10 train/test split  7.46    40  5.83
 11 train/test split  9.87    41  9.93
 12 train/test split 10.9     43 10.7 
 13 train/test split 10.2     49 10.5 
 14 train/test split 12.0     58 11.6 
 15 train/test split 13.3     59  9.54
 16 train/test split 12.2     60  9.54
 17 train/test split 11.1     63 11.9 
 18 train/test split 13.7     66 22.3 
 19 train/test split 10.5     70 10.2 
 20 train/test split 10.6     71  8.23
 21 train/test split  9.31    77  7.09
 22 train/test split 14.3     79 17.1 
 23 train/test split 13.5     83 16.1 
 24 train/test split 12.8     84 11.8 
 25 train/test split 13.5     86 15.0 
 26 train/test split 12.8     88 14.2 
 27 train/test split 11.3     97 10.4 
 28 train/test split 12.4     98 11.2 
 29 train/test split  9.50   100 14.2 
 30 train/test split 12.0    101 13.3 
 31 train/test split 10.1    102  8.25
 32 train/test split  8.37   103  6.98
 33 train/test split 14.0    104 16.4 
 34 train/test split 13.4    105 18.2 
 35 train/test split 12.6    107 12.0 
 36 train/test split 13.0    108 13.3 
 37 train/test split 11.1    116 13.4 
 38 train/test split  8.40   123  7.61
 39 train/test split  8.33   126 10.4 
 40 train/test split  9.29   131 15.3 
 41 train/test split 13.1    136 13.0 
 42 train/test split 10.1    138 10.4 
 43 train/test split  9.18   141 10.0 
 44 train/test split 12.5    142 10.6 
 45 train/test split  8.55   145  7.69
 46 train/test split  8.53   147  7.99
 47 train/test split  8.00   155  7.34
 48 train/test split 11.1    160 11.8 
 49 train/test split 10.0    163 10.5 
 50 train/test split 11.4    166 11.5 
 51 train/test split 11.4    169 10.6 
 52 train/test split 10.2    170 11.6 
 53 train/test split 10.7    171 10.1 
 54 train/test split 11.8    178 12.2 
 55 train/test split 12.0    181 11.7 
 56 train/test split  9.63   182  7.26
 57 train/test split 10.6    184  9.91
 58 train/test split  8.76   188  8.13
 59 train/test split 10.2    189  7.82
 60 train/test split  9.45   194  8.23
 61 train/test split  9.02   195  7.02
 62 train/test split  8.21   203  6.08
 63 train/test split  9.12   205  8.08
 64 train/test split  9.94   207  8.10
 65 train/test split  9.38   208  7.55
 66 train/test split  9.79   210  8.00
 67 train/test split 12.5    211 12.3 
 68 train/test split 11.6    212 12.2 
 69 train/test split 11.8    214 11.5 
 70 train/test split 11.4    215 12.3 
 71 train/test split 13.0    216 13.8 
 72 train/test split 12.0    218 13.1 
 73 train/test split 11.8    221 12.9 
 74 train/test split 11.6    227 11.9 
 75 train/test split 11.7    228 11.7 
 76 train/test split 11.9    231 12.6 
 77 train/test split 12.5    234 13.7 
 78 train/test split  8.59   243  6.25
 79 train/test split 10.4    248 11.0 
 80 train/test split 12.1    249 12.6 
 81 train/test split 11.8    250 11.8 
 82 train/test split 12.4    251 12.2 
 83 train/test split 12.3    254 12.9 
 84 train/test split 12.0    258 11.3 
 85 train/test split 12.4    260 13.2 
 86 train/test split 10.7    262 12.4 
 87 train/test split 11.0    265 10.4 
 88 train/test split 12.2    270 11.9 
 89 train/test split 11.5    276 10.4 
 90 train/test split 11.6    280 11.1 
 91 train/test split 12.2    281 11.7 
 92 train/test split 12.1    284 11.7 
 93 train/test split 12.7    291 12.6 
 94 train/test split 11.3    293 11.4 
 95 train/test split 12.0    294 11.5 
 96 train/test split 12.3    296 11.9 
 97 train/test split 12.5    297 12.5 
 98 train/test split 12.3    298 13.9 
 99 train/test split 11.7    303 11.2 
100 train/test split 13.1    305 15.1 
101 train/test split 12.6    307 13.1 
102 train/test split 12.8    309 13.3 
103 train/test split 11.4    313 11.8 
104 train/test split 12.1    319 11.9 
105 train/test split 10.4    326 10.3 
106 train/test split  9.34   328  8.81
107 train/test split 11.9    335 11.2 
108 train/test split  8.68   338  9.85
109 train/test split 10.5    343  9.63
110 train/test split 10.6    344  9.47
111 train/test split 10.5    345  9.39
112 train/test split 11.0    350 12.5 
113 train/test split 12.2    358 12.1 
114 train/test split 12.0    361 11.7 
115 train/test split 13.5    362 13.2 
116 train/test split 13.7    363 13.4 
117 train/test split 12.4    365 12.7 
118 train/test split 11.4    371 12.5 
119 train/test split 11.8    375 11.4 
120 train/test split 10.3    378  9.57
121 train/test split 10.6    379  9.15
122 train/test split 11.4    380  9.28
123 train/test split 10.2    382  9.23
124 train/test split  8.08   389  5.58
125 train/test split 12.9    390 12.7 
126 train/test split 11.5    395  9.64
127 train/test split 11.8    397 12.2 
128 train/test split 12.4    400 12.2 
129 train/test split 13.0    403 14.3 
130 train/test split  9.41   405  9.96
131 train/test split 10.0    413  8.70
132 train/test split 11.3    417  9.82
133 train/test split 10.7    421  9.64
134 train/test split 10.0    424 10.8 
135 train/test split 11.0    427 10.1 
136 train/test split 11.7    428  9.87
137 train/test split 11.3    429 11.2 
138 train/test split 11.4    430 11.1 
139 train/test split 11.3    431 10.6 
140 train/test split 10.5    432  9.93
141 train/test split  8.66   434  7.61
142 train/test split 11.3    440 11.1 
143 train/test split 11.3    441 11.4 
144 train/test split 12.3    442 11.9 
145 train/test split 12.5    444 11.9 
146 train/test split 12.6    449 11.8 
147 train/test split 12.6    450 12.3 
148 train/test split 10.5    456 10.1 
149 train/test split  8.14   457  7.01
150 train/test split 10.8    460 11.1 
151 train/test split 10.9    461 10.9 
152 train/test split  8.47   463  6.70
153 train/test split 10.5    468 10.7 
154 train/test split 11.0    469 11.8 
155 train/test split  9.98   471 13.0 
156 train/test split 11.1    474 12.1 
157 train/test split 10.3    475 10.1 
158 train/test split 11.1    476 13.1 
159 train/test split 11.3    478 11.8 
160 train/test split 11.6    479 11.6 
161 train/test split 10.3    480 12.0 
162 train/test split  9.70   482 10.1 
163 train/test split 10.4    484 10.8 
164 train/test split 11.8    485 12.5 
165 train/test split 11.7    486 11.4 
166 train/test split 12.2    488 12.2 
167 train/test split 12.7    491 12.7 
168 train/test split  8.35   495  9.26
169 train/test split  6.84   497  5.34
170 train/test split  6.46   499  6.89
171 train/test split  7.49   501  7.32
172 train/test split  7.83   508  7.26
173 train/test split 10.3    510  8.83
174 train/test split  8.83   511  7.69
175 train/test split 10.1    512  7.87
176 train/test split 11.0    518  9.01
177 train/test split  7.75   522  6.68
178 train/test split 12.4    529 11.5 
179 train/test split 12.6    534 11.9 
180 train/test split 12.6    538 13.2 
181 train/test split 11.1    540 10.0 
182 train/test split 11.8    541 10.9 
183 train/test split 11.3    542  9.43
184 train/test split 10.1    543  8.84
185 train/test split 12.3    548 11.9 
186 train/test split  7.08   552  6.22
187 train/test split 10.4    561  8.23
188 train/test split 12.2    564 11.7 
189 train/test split 12.1    569 11.9 
190 train/test split 12.2    573 13.2 
191 train/test split 11.3    574  9.67
192 train/test split 12.2    577 11.0 
193 train/test split 12.3    578 12.0 
194 train/test split 10.8    585  9.31
195 train/test split 11.6    587 12.7 
196 train/test split 10.9    588 11.1 
197 train/test split 11.5    592 11.4 
198 train/test split 11.4    595 12.3 
199 train/test split 12.2    596 12.4 
200 train/test split 10.9    597 11.6 
201 train/test split 11.4    598 12.5 
202 train/test split  9.68   599 11.7 
203 train/test split 10.1    600 10.5 
204 train/test split 11.6    601 10.5 
205 train/test split 10.8    602 11.6 
206 train/test split 11.9    605 13.1 
207 train/test split 10.4    607  9.86
208 train/test split 11.3    608 11.0 
209 train/test split 11.9    613 12.9 
210 train/test split 10.9    623 10.7 
211 train/test split 11.8    627 11.7 
212 train/test split 12.6    629 10.8 
213 train/test split 12.9    634 11.9 
214 train/test split 11.2    638 11.6 
215 train/test split 13.3    642 14.4 
216 train/test split 12.3    643 13.3 
217 train/test split 12.9    645 14.5 
218 train/test split 13.1    646 14.5 
219 train/test split 12.4    649 13.2 
220 train/test split 11.5    658 12.2 
221 train/test split 11.6    659 12.0 
222 train/test split 11.7    660 12.0 
223 train/test split 12.4    661 13.4 
224 train/test split 11.8    665 13.2 
225 train/test split 12.2    666 12.0 
226 train/test split  8.80   667  9.68
227 train/test split 11.2    670 10.3 
228 train/test split 11.3    675 11.7 
229 train/test split 11.1    676 11.4 
230 train/test split  8.05   678 11.7 
231 train/test split  8.45   682 13.3 
232 train/test split  8.52   687 13.1 
233 train/test split 12.6    696 17.1 
234 train/test split 11.9    697 10.8 
235 train/test split 12.5    701 13.9 
236 train/test split 11.0    707 13.0 
237 train/test split 12.4    708 13.2 
238 train/test split 13.0    719 13.4 
239 train/test split  9.69   726  6.61
240 train/test split 10.1    727  9.24
241 train/test split 11.2    733 11.1 
242 train/test split 12.1    736 12.6 
243 train/test split 10.9    737 12.2 
244 train/test split  9.67   740  9.10
245 train/test split 10.7    743 11.8 
246 train/test split 11.6    744 12.0 
247 train/test split  8.24   747  9.59
248 train/test split 12.3    754 12.3 
249 train/test split  9.21   757  9.95
250 train/test split  9.99   758  8.71
251 train/test split  5.85   760  5.68
252 train/test split 11.9    763 10.8 
253 train/test split 10.5    773 10.2 
254 train/test split 10.3    775 12.0 
255 train/test split 10.6    776 11.0 
256 train/test split 10.5    777  8.94
257 train/test split 11.6    780 10.3 
258 train/test split  9.26   782  8.87
259 train/test split  7.89   784  8.20
260 train/test split  8.88   787 10.6 
261 train/test split  8.94   794  7.90
262 train/test split  8.95   795  8.23
263 train/test split  9.23   796  9.86
264 train/test split  9.03   798  7.33
265 train/test split 12.2    803 12.0 
266 train/test split 12.1    806 11.1 
267 train/test split 11.9    808 11.8 
268 train/test split 11.9    810 10.8 
269 train/test split 11.1    811 10.5 
270 train/test split 12.0    816 11.6 
271 train/test split 10.7    817 10.2 
272 train/test split  9.33   822  8.39
273 train/test split  8.97   826  7.35
274 train/test split  9.26   827  8.40
275 train/test split 12.7    832 14.2 
276 train/test split 12.6    833 14.3 
277 train/test split 11.4    835 12.6 
278 train/test split 11.7    840 12.6 
279 train/test split 11.8    841 12.8 
280 train/test split  7.53   844  6.38
281 train/test split 10.5    846 12.4 
282 train/test split 10.4    847 10.6 
283 train/test split  7.87   848  6.96
284 train/test split  9.94   849 12.7 
285 train/test split 12.6    853 12.8 
286 train/test split 10.2    860 10.8 
287 train/test split  7.49   862 10.2 
288 train/test split  8.02   865  5.25
289 train/test split  6.38   866  5.85
290 train/test split  6.63   868  3.50
291 train/test split  6.98   871  7.05
292 train/test split  6.02   872  7.82

Data Visualization

Our main question for this case study was:

  1. Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

We have indeed created a model that can predict fine particulate matter air pollution levels based on our predictor variables.

Now let’s make a plot of our predicted values and the true values.

First, let’s start by making a plot of our monitors:

We will use the following packages to create a map of the US: 1)sf 2)maps 2)rnaturalearth 3)rgeos

According to this link on wikipedia, these are the latitude and longitude bounds of the continental US.

top = 49.3457868 # north lat left = -124.7844079 # west long right = -66.9513812 # east long bottom = 24.7433195 # south lat

We will start with getting an outline of the US with the ne_countries() function of the rnaturalearth package.

Now let’s add county lines.

County graphical data is available from the maps package. The sf package which is short for simple features creates a data frame about this graphical data so that we can work with it.

Now let’s add a fill at the county level for the true monitor values of air pollution:

Simple feature collection with 6 features and 1 field
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
CRS:            EPSG:4326
               ID                           geom
1 alabama,autauga MULTIPOLYGON (((-86.50517 3...
2 alabama,baldwin MULTIPOLYGON (((-87.93757 3...
3 alabama,barbour MULTIPOLYGON (((-85.42801 3...
4    alabama,bibb MULTIPOLYGON (((-87.02083 3...
5  alabama,blount MULTIPOLYGON (((-86.9578 33...
6 alabama,bullock MULTIPOLYGON (((-85.66866 3...

Now let’s do the same with our predicted values.

Let’s grab both the testing and training fitted values so that we have as much data as possible. In this case, the output structure for the training data fit is slightly different using randomForest. The fitted values are called predicted and the broom functions like tidy() and augment() will not work. So we will manually grab the fitted training data values.

# A tibble: 292 x 5
   .pred value fips  county     id       
   <dbl> <dbl> <fct> <chr>      <fct>    
 1 11.1  11.7  1049  DeKalb     1049.1003
 2 12.0  13.1  1073  Jefferson  1073.101 
 3 11.6  12.2  1073  Jefferson  1073.2006
 4 11.3  12.2  1089  Madison    1089.0014
 5 11.5  11.4  1103  Morgan     1103.0011
 6 12.0  12.2  1121  Talladega  1121.0002
 7 10.8  10.9  4013  Maricopa   4013.4003
 8 10.7  10.6  4021  Pinal      4021.0001
 9  8.60 14.1  4023  Santa Cruz 4023.0004
10  7.59  5.83 4025  Yavapai    4025.2002
# … with 282 more rows

quartz_off_screen 
                2 

Summary

Let’s review everything:

We have explored gravimetric monitoring data of fine particulate matter air pollution. We have utilized the tidymodels ecosystem of packages to predict monitor values using a variety of predictors, also known as explanatory variables, including satellite data, road density data, and population density, among others. Our model could now be extended to be used to predict pollution levels in areas poor monitoring, to help identify regions where populations maybe especially at risk for the health effects of air pollution.

We learned that there are two major types of what is called supervised machine learning: prediction and classification. We learned that prediction is used when the outcome variable is numeric and classification is performed when the outcome variable is categorical.

We performed the major steps of machine learning that we introduced in the beginning of the data analysis:

  1. Data exploration

We used a packages like skimr, summarytools, corrplot, ggcorrplot, and GGally to better understand our data. These packages gave can tell us how many missing values each variable has (if any), the class of each variable, the distribution of values for each variable, the sparsity of each variable, and the level of correlation between variables.

  1. Data splitting

We used the rsample package to first perform an initial split of our data into two pieces: a training set and a testing set. The training set was used to optimize the model, while the testing set was used only to evaluate the performance of our final model. We also used the rsample package to create cross validation subsets of our training data. This allowed us to better assess the performance of our tested models using our training data.

  1. Variable assignment and preprocessing

We used the recipes package to assign variable roles (such as outcome, predictor, and id variable). We also used this package to create a recipe for preprocessing our training and testing data. This involved steps such as: step_dummy to create dummy numeric encodings of our categorical variables, step_corr to remove highly correlated variables, step_nzv to remove near zero variance variables that would contribute little to our model and potentially add noise. We learned that once our recipe was created and prepped using prep()we could extract the pre-processed training data using juice() or our pre-processed testing data using bake(). We also learned that if we used the newer workflows package that we did not need to the prep(), juice(), or bake() functions, but that it is still useful to know how to do so if we want to look at our data and how the recipe is influencing it more deeply.

  1. Model specification, fitting, tuning and performance evaluation using the training data

We learned that the model needs to first be fit to the training data. We learned that in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. We learned that we specify the model and its specifications using the parnsip package and that we also use this package to fit the model using the fit() function. We learned that we if just use parsnip to fit the model, then we need to use the pre-processed training data (output from juice()). We learned that we can use the raw training data if we use the workflows package to create a workflow that pre-processes our data for us.

We learned that if the model fits well than the estimated values will be very similar to the true outcome variable values in our training data. We learned that we can assess model performance using the yardstick package and the metrics() function. We also learned that we can use subsets of our training data (which we created with the rsample package) to perform cross validation to get a better estimate about the performance of our model using our training data, as we want our results to be generalizable and to perform well with other data, not just our training data. We used the fit_resamples() function of the tune package to fit our model on our different training data subsets and the collect_metrics() function (also of the tune package) to evaluate model performance using these subsets. We also learned that we can potentially improve model performance by tuning aspects about the model called hyper-parameters to determine the best option for model performance. We learned that we can do this using the tune and dials packages and evaluating the performance of our model with the different hyper-parameter options and our training data subsets that we used for cross validation. After we tested several different methods to model our data, we compared them to choose the best performing model as our final model.

  1. Overall model performance evaluation

Once we chose our final model, we evaluated the final model performance using the testing data. This gives us a better estimate about how well the model will predict or classify the outcome variable of interest with new independent data. Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources.

We first fit our model to our testing data using either just parsnip and the pre-processed testing data (using the bake() recipes function), or our raw testing data if we used a workflow. We used the same performance evaluation functions (yardstick::metrics() and tune::collect_metrics()(when using cross validation)). We also learned how we can use the last_fit() function of the tune package if we used a workflow to get the test data performance using the initial data and the testing/training split information.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in this article that many additional considerations would be involved to adequately understand the data enough to recommend policy changes.

Suggested Homework

Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.

Session info


R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0       rgeos_0.5-3         sp_1.4-2           
 [4] rnaturalearth_0.1.0 maps_3.3.0          sf_0.9-3           
 [7] lwgeom_0.2-4        tidyr_1.1.0         stringr_1.4.0      
[10] randomForest_4.6-14 vip_0.2.2           yardstick_0.0.6    
[13] workflows_0.1.1     tune_0.1.0          tibble_3.0.1       
[16] rsample_0.0.7       recipes_0.1.12      purrr_0.3.4        
[19] parsnip_0.1.1       infer_0.5.1         dials_0.0.7        
[22] scales_1.1.1        broom_0.5.6         tidymodels_0.1.0   
[25] GGally_2.0.0        ggcorrplot_0.1.3    ggplot2_3.3.1      
[28] RColorBrewer_1.1-2  corrplot_0.84       magrittr_1.5       
[31] summarytools_0.9.6  skimr_2.1.1         dplyr_1.0.0        
[34] readr_1.3.1         knitr_1.28          here_0.1           

loaded via a namespace (and not attached):
  [1] utf8_1.1.4              tidyselect_1.1.0        lme4_1.1-23            
  [4] htmlwidgets_1.5.1       grid_4.0.1              pROC_1.16.2            
  [7] munsell_0.5.0           codetools_0.2-16        units_0.6-6            
 [10] statmod_1.4.34          DT_0.13                 future_1.17.0          
 [13] miniUI_0.1.1.1          withr_2.2.0             colorspace_1.4-1       
 [16] highr_0.8               rstudioapi_0.11         stats4_4.0.1           
 [19] bayesplot_1.7.2         listenv_0.8.0           labeling_0.3           
 [22] rstan_2.19.3            repr_1.1.0              rnaturalearthdata_0.1.0
 [25] DiceDesign_1.8-1        farver_2.0.3            rprojroot_1.3-2        
 [28] vctrs_0.3.1             generics_0.0.2          ipred_0.9-9            
 [31] xfun_0.14               R6_2.4.1                markdown_1.1           
 [34] rstanarm_2.19.3         lhs_1.0.2               reshape_0.8.8          
 [37] assertthat_0.2.1        promises_1.1.1          nnet_7.3-14            
 [40] gtable_0.3.0            globals_0.12.5          processx_3.4.2         
 [43] timeDate_3043.102       rlang_0.4.6             splines_4.0.1          
 [46] rapportools_1.0         checkmate_2.0.0         inline_0.3.15          
 [49] yaml_2.2.1              reshape2_1.4.4          tidytext_0.2.4         
 [52] threejs_0.3.3           crosstalk_1.1.0.1       backports_1.1.7        
 [55] httpuv_1.5.4            rsconnect_0.8.16        tokenizers_0.2.1       
 [58] tools_4.0.1             lava_1.6.7              tcltk_4.0.1            
 [61] ellipsis_0.3.1          ggridges_0.5.2          Rcpp_1.0.4.6           
 [64] plyr_1.8.6              base64enc_0.1-3         classInt_0.4-3         
 [67] ps_1.3.3                prettyunits_1.1.1       rpart_4.1-15           
 [70] zoo_1.8-8               furrr_0.1.0             magick_2.3             
 [73] colourpicker_1.0        GPfit_1.0-8             SnowballC_0.7.0        
 [76] matrixStats_0.56.0      tidyposterior_0.0.3     hms_0.5.3              
 [79] shinyjs_1.1             mime_0.9                evaluate_0.14          
 [82] xtable_1.8-4            tidypredict_0.4.5       shinystan_2.5.0        
 [85] gridExtra_2.3           rstantools_2.1.0        compiler_4.0.1         
 [88] KernSmooth_2.23-17      crayon_1.3.4            minqa_1.2.4            
 [91] StanHeaders_2.21.0-3    htmltools_0.4.0         mgcv_1.8-31            
 [94] later_1.1.0.1           lubridate_1.7.8         DBI_1.1.0              
 [97] MASS_7.3-51.6           boot_1.3-25             Matrix_1.2-18          
[100] cli_2.0.2               pryr_0.1.4              parallel_4.0.1         
[103] gower_0.2.1             igraph_1.2.5            pkgconfig_2.0.3        
[106] foreach_1.5.0           dygraphs_1.1.1.6        hardhat_0.1.3          
[109] prodlim_2019.11.13      janeaustenr_0.1.5       callr_3.4.3            
[112] digest_0.6.25           rmarkdown_2.2           shiny_1.4.0.2          
[115] gtools_3.8.2            nloptr_1.2.2.1          lifecycle_0.2.0        
[118] nlme_3.1-148            jsonlite_1.6.1          viridisLite_0.3.0      
[121] fansi_0.4.1             pillar_1.4.4            lattice_0.20-41        
[124] loo_2.2.0               fastmap_1.0.1           pkgbuild_1.0.8         
[127] survival_3.1-12         glue_1.4.1              xts_0.12-0             
[130] shinythemes_1.1.2       iterators_1.0.12        pander_0.6.3           
[133] class_7.3-17            stringi_1.4.6           e1071_1.7-3            
---
title: "Open Case Studies : Predicting Annual Air Pollution "
css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>




```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
```


#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "main_plot_maps.png"))
```

####

## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## Motivation
A variety of different sources contribute different types of pollutants to what we call air pollution. 
Some sources are natural while others are anthropogenic (human derived):

<p align="center">
  <img width="600" src="https://www.nps.gov/subjects/air/images/Sources_Graphic_Huge.jpg?maxwidth=1200&maxheight=1200&autorotate=false">
</p>

##### [[source](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.nps.gov%2Fsubjects%2Fair%2Fsources.htm&psig=AOvVaw2v7AVxSF8ZSAPEhNudVtbN&ust=1585770966217000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPDN66q_xegCFQAAAAAdAAAAABAD)]{target="_blank"}

#### Major types of air pollutants

1) **Gaseous** - Carbon Monoxide (CO), Ozone (O~3~), Nitrogen Oxides(NO, NO~2~), Sulpher Dioxide (SO~2~)
2) **Particulate** - small liquids and solids suspended in the air (includes lead- can include certain types of dust)
3) **Dust** - small solids (larger than particulates) that can be suspended in the air for some time but eventually settle
4) **Biological** - pollen, bacteria, viruses, mold spores

See [here])http://www.redlogenv.com/worker-safety/part-1-dust-and-particulate-matter) for more detail on the types of pollutants in the air.


#### Particulate pollution 

Air pollution particulates are generally described by their **size**.

There are 3 major categories:

1) **Large Coarse** Particulate Mater - has diameter of >10 micrometers (10 µm) 

2) **Coarse** Particulate Mater (called **PM~10-2.5~**) - has diameter of between 2.5 µm and 10 µm

3) **Fine** Particulate Mater (called **PM~2.5~**) - has diameter of < 2.5 µm 

**PM~10~** includes any particulate mater <10 µm (both coarse and fine particulate mater)

Here you can see how these sizes compare with a human hair:

```{r, echo = FALSE, out.width= "600 px"}
knitr::include_graphics(here::here("img", "pm2.5_scale_graphic-color_2.jpg"))
```

##### [[source](https://www.epa.gov/pm-pollution/particulate-matter-pm-basics)]{target="_blank"}

<!-- <p align="center"> -->
<!--   <img width="500" src="https://www.sensirion.com/images/sensirion-specialist-article-figure-1-cdd70.jpg"> -->
<!-- </p> -->


<u>The following plot and table show the relative sizes of these different pollutants in micrometers(µm):</u>

<p align="center">
  <img width="600" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Airborne-particulate-size-chart.svg/800px-Airborne-particulate-size-chart.svg.png">
</p>

##### [[source](https://en.wikipedia.org/wiki/Particulates)]{target="_blank"}


<p align="center">
  <img width="500" src="https://www.frontiersin.org/files/Articles/505570/fpubh-08-00014-HTML/image_m/fpubh-08-00014-t002.jpg">
</p>

##### [[source](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full)]{target="_blank"}


<u>This table shows how deeply some of the smaller fine particles can penetrate within the human body:</u>

<p align="center">
  <img width="500" src="https://www.frontiersin.org/files/Articles/505570/fpubh-08-00014-HTML/image_m/fpubh-08-00014-t001.jpg">
</p>

##### [[source](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full)]{target="_blank"}


#### Negative Impact of Particulate Exposure on Health 

Exposure to air pollution is associated with higher rates of [mortality](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5783186/){target="_blank"} in older adults and is known to be a risk factor for many diseases and conditions including but not limited to:

1) [Asthma](https://www.ncbi.nlm.nih.gov/pubmed/29243937){target="_blank"} - fine particle exposure (**PM~2.5~**) was found to be associated with higher rates of asthma in children
2) [Inflammation in type 1 diabetes](https://www.ncbi.nlm.nih.gov/pubmed/31419765){target="_blank"} - fine particle exposure (**PM~2.5~**) from traffic-related air pollution was associated with increased measures of inflammatory markers in youths with type 1 diabetes
3) [Lung function and emphysema](https://www.ncbi.nlm.nih.gov/pubmed/31408135){target="_blank"} - higher concentrations of ozone (O~3~), nitrogen oxides (NO~x~), black carbon, and fine particle exposure **PM~2.5~** , at study baseline were significantly associated with greater increases in percent emphysema per 10 years 
4) [Low birthweight](https://www.ncbi.nlm.nih.gov/pubmed/31386643){target="_blank"} - fine particle exposure(**PM~2.5~**) was associated with lower birth weight in full-term live births
5) [Viral Infection](https://www.tandfonline.com/doi/full/10.1080/08958370701665434){target="_blank"} - higher rates of infection and increased severity of infection are associated with higher exposures to pollution levels including fine particle exposure (**PM~2.5~**)

See this [review article](https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full){target="_blank"} for more information about sources of air pollution and the influence of air pollution on health.

#### Sparse Monitoring is Problematic for Public Health

Historically epidemiological studies would assess the influence of air pollution on health outcomes by relying on a number of monitors located around the country. However as can be seen in the following figure, these monitors remain to be relatively sparse in certain regions of the country. Furthermore, dramatic differences in pollution rates can be seen even within the same city.

<p align="center">
  <img width="400" src="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4137272/bin/1476-069X-13-63-1.jpg">
</p>

##### [[source](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63)]{target="_blank"}

This lack of granularity in air pollution monitoring has hindered our ability to discern the full impact of air pollution on health and to identify at-risk locations. 


#### Machine Learning Offers a Solution

An [article](https://ehjournal.biomedcentral.com/articles/10.1186/1476-069X-13-63){target="_blank"} published in the *Environmental Health* journal dealt with this issue by using data about population density, road density, among other features to model or predict air pollution levels at a more localized scale using machine learning methods. 

```{r, echo = FALSE, out.width= "800 px"}
knitr::include_graphics(here::here("img", "thepaper.png"))
```

#### {.reference_block}
Yanosky, J. D. et al. Spatio-temporal modeling of particulate air pollution in the conterminous United States using geographic and meteorological predictors. *Environ Health* 13, 63 (2014).

####

The authors of this article state that:

> "Exposure to atmospheric particulate matter (PM) remains an important public health concern,
although it remains difficult to quantify accurately across large geographic areas with sufficiently high spatial
resolution. Recent epidemiologic analyses have demonstrated the importance of spatially- and temporally-resolved
exposure estimates, which show larger PM-mediated health effects as compared to nearest monitor or
county-specific ambient concentrations." 


```{r, echo = FALSE, out.width= "700 px", eval = FALSE}
knitr::include_graphics(here::here("img", "deaths.png"))
```

The article above explains that machine learning methods can be used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. We will use similar methods to predict annual air pollution levels spatially within the US.


### Main Questions

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

####

### Learning Objectives 

In this case study, we will walk you through importing data from CSV files and performing machine learning methods to predict our outcome variable of interest (in this case annual fine particle air pollution estimates). We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"}, and more specifically the [`tidymodels`](https://cran.r-project.org/web/packages/tidymodels/tidymodels.pdf){target="_blank"} package/ecosystem primarily developed and maintained by [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} and [Davis Vaughan](https://resources.rstudio.com/authors/davis-vaughan){target="_blank"}. This package loads more modeling related packages like `rsample`, `recipes`, `parsnip`, `yardstick`,  and `dials`. We will also briefly cover the `workflows` and `tune` packages. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```

```{r, out.width = "100px", echo = FALSE, fig.align ="center"}
include_graphics("https://pbs.twimg.com/media/DkBFpSsW4AIyyIN.png")
```


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readr)
library(dplyr)
library(skimr)
library(summarytools)
library(magrittr)
library(corrplot)
library(RColorBrewer)
library(ggcorrplot)
library(GGally)
library(tidymodels)# broom, dials, infer, parsnip, purrr, recipes, rsample, tibble, yardstick
library(workflows)
library(vip)
library(tune)
library(randomForest)
library(ggplot2)
library(stringr)
library(tidyr)
library(lwgeom) # allows 1263
library(sf)
library(maps)
library(rnaturalearth)
library(rgeos)
library(cowplot)
```


 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to view/arrange/filter/select/compare specific subsets of the data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` pipping operator 
[corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html){target="_blank"} | to make large correlation plots
[ggcorrplot](http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2)| also to make large correlation plots
[GGally](https://cran.r-project.org/web/packages/GGally/GGally.pdf){target="_blank"} | to make smaller correlation plots  
[rsample](https://tidymodels.github.io/rsample/articles/Basics.html){target="_blank"}   | to split the data into testing and training sets and to split the training set for cross-validation  
[recipes](https://tidymodels.github.io/recipes/){target="_blank"}   | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()` , `prep()` and various transformation `step_*()` functions, as well as `juice()` - extracts final preprocessed training data and `bake()` - applies recipe steps to testing data). See [here](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"}  for more info.
[parsnip](https://tidymodels.github.io/parsnip/){target="_blank"}   | an interface to create models (major functions are  `fit()`, `set_engine()`)
[yardstick](https://tidymodels.github.io/yardstick/){target="_blank"}   | to evaluate the performance of models
[broom](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/){target="_blank"} | to get tidy output for our model fit and performance
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[dials](https://www.tidyverse.org/blog/2019/10/dials-0-0-3/){target="_blank"} | to specify hyper-parameter tuning
[tune](https://tune.tidymodels.org/){target="_blank"} | to perform cross validation, tune hyper-parameters, and get performance metrics
[workflows](https://www.rdocumentation.org/packages/workflows/versions/0.1.1){target="_blank"}| to create modeling workflow to streamline the modeling process
[vip](https://cran.r-project.org/web/packages/vip/vip.pdf){target="_blank"} | to create variable importance plots
[randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf)| to perform the random forest analysis
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text the map data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[rnaturalearth](https://cran.r-project.org/web/packages/rnaturalearth/README.html){target="_blank"} | to get the geometry data for the earth to plot the US
[maps](https://cran.r-project.org/web/packages/maps/maps.pdf){target="_blank"} | to get map database data about counties to draw them on our US map
[sf](https://r-spatial.github.io/sf/){target="_blank"} | to convert the map data into a data frame
[lwgeom](https://cran.r-project.org/web/packages/lwgeom/lwgeom.pdf){target="_blank"} | to use the `sf` function to convert the map geographical data
[rgeos](https://cran.r-project.org/web/packages/rgeos/rgeos.pdf){target="_blank"} | to use geometry data
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
___



The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


### Context

The [State of Global Air](https://www.stateofglobalair.org/){target="_blank"} is a report released every year to communicate the impact of air pollution on public health. 

The [State of Global Air 2019 report](https://www.stateofglobalair.org/sites/default/files/soga_2019_report.pdf){target="_blank"}
which uses data from 2017 stated that:

> Air pollution is the **fifth** leading risk factor for mortality worldwide. It is responsible for more
deaths than many better-known risk factors such as malnutrition, alcohol use, and physical inactivity.
Each year, **more** people die from air pollution–related disease than from road **traffic injuries** or **malaria**.

<p align="center">
  <img width="600" src="https://www.healtheffects.org/sites/default/files/SoGA-Figures-01.jpg">
</p>

The report also stated that:

>In 2017, air pollution is estimated to have contributed to close to 5 million
deaths globally — nearly **1 in every 10 deaths**.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017deaths.png"))
```
##### [[source]](https://www.stateofglobalair.org/sites/default/files/soga_2019_fact_sheet.pdf){target="_blank"}

The [State of Global Air 2018 report](https://www.stateofglobalair.org/sites/default/files/soga-2018-report.pdf){target="_blank"}  using data from 2016 which separated different types of air pollution, found that **particulate pollution was particularly associated with mortality**.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","2017mortality.png"))
```

The 2019 report shows that the highest levels of fine particulate pollution occurs in Africa and Asia and that:

> More than **90%** of people worldwide live in areas **exceeding** the World Health Organization (WHO) **Guideline** for healthy air. More than half live in areas that do not even meet WHO’s least-stringent air quality target.

```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","PMworld.png"))
```

Looking at the US specifically, air pollution levels are generally improving. The US Environmental Protection Agency (EPA) also releases a report about air pollution levels called [*Our Nation's Air*](https://gispub.epa.gov/air/trendsreport/2019/#home){target="_blank"}.

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "US.png"))
```

##### [[source]](https://gispub.epa.gov/air/trendsreport/2019/documentation/AirTrends_Flyer.pdf){target="_blank"}

However, air pollution **continues to contribute to health risk for Americans**, in particular in **regions with higher than national average rates** of pollution that actually at time exceed the world health organization's recommended level. Thus it is necessary to obtain high spatial granularity in estimates of air pollution in order to identify locations where populations are experiencing harmful levels of exposure.


You can see that current air quality conditions at this [website](https://aqicn.org/city/usa/){target="_blank"} and you will notice variation across different cities.

Here were the conditions in Topeka Kansas when this was written:

```{r, echo = FALSE}
knitr::include_graphics(here::here("img", "Kansas.png"))
```

It reports particulate values using what is called the [Air Quality Index](https://www.airnow.gov/index.cfm?action=aqibasics.aqi){target="_blank"} scale (AQI), this [calculator](https://airnow.gov/index.cfm?action=airnow.calculator){target="_blank"} indicates that 114 AQI is equivalent to 40.7 ug/m^3^ and is considered unhealthy for sensitive individuals. Thus some areas very much exceed the World Health Organization (WHO)  annual exposure guideline (10 ug/m^3^) at certain times and this may adversely affect the health of people living in these locations.

Furthermore, adverse health effects have been associated with populations experiencing higher pollution exposure despite the levels being below suggested guidelines. Secondly, it appears that the composition of the particulate mater and the influence of other demographic factors may make specific populations more at risk for adverse health effects due to air pollution. See this [article](https://www.nejm.org/doi/full/10.1056/NEJMoa1702747){target="_blank"} for more details.

The monitor data that we will use in this case study comes from a system of monitors in which roughly 90% are located within cities. Thus there is an **equity issue** in terms of capturing the air pollution levels of more rural areas. Therefore, to get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be very useful to estimate air pollution levels in **areas with little to no monitoring**.

Indeed, machine learning methods are in fact used to be able to estimate air pollution in these low monitoring areas so that we can make a map like this where we have annual estimates for all of the contiguous US:

<p align="center">
  <img width="600" src="https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/SAWOEGBXMVGQ7AS5PZ6UUOX6FY.png">
</p>


##### [[source]](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.washingtonpost.com%2Fbusiness%2F2019%2F10%2F23%2Fair-pollution-is-getting-worse-data-show-more-people-are-dying%2F&psig=AOvVaw3v-ZDTBPnLP2MYtKf3Undj&ust=1585784479068000&source=images&cd=vfe&ved=0CAIQjRxqFwoTCPCyn9fxxegCFQAAAAAdAAAAABAd){target="_blank"}

This is what we aim to achieve in this case study.

### Limitations

There are some important considerations regarding this data analysis to keep in mind: 

1) The data in our analysis does not include information about the composition of particulate mater. Different types of particulates may be more benign or deleterious for health outcomes.

2) Outdoor pollution levels are not necessarily an indication of of individual exposures. People spend differing amounts of time indoors and outdoors and are exposed to different pollution levels indoors. People are now developing personal monitoring systems to track air pollution levels on the personal level.

Our analysis will use annual mean estimates, however pollution levels can vary greatly by season, day and even hour. There are data sources that have finer levels of temporal data, however we are interested in long term exposures, as these appear to be the most influential for health outcomes, so we chose to use annual level data. 


## What are the data?

In Machine Learning for prediction, there are two main types of variables:

1) Outcome variable
2) Predictor variables

The **outcome variable** is what are trying to **predict**. In building our model we actually have the outcome variable data, but we want to see how well our predictor variables can explain the variation in our outcome data. This gives us a sense of how well we can use the predictor variable data to predict our outcome variable levels when we in fact do not have data about the outcome.

As a simpler example, imagine that we have data about the sales and characteristics of cars from last year and we want to predict which cars might sell well this year. We do not have the sales data yet for this year, but we do know the characteristics of our cars for this year. We can use a model of the characteristics that explained sales last year to estimate what cars might sell well this year. In this case, our outcome variable is the sale performance of the cars, while the different characteristics of the cars make up our predictor or explanatory variables.

In this case study, we will evaluate air pollution monitor data of fine particulate mater (PM~2.5~) in the contiguous US from 2008, as well as data about population density, road density, urbanization levels, and NASA satellite data to develop models to predict localized air pollution levels. 

The monitor data will be our **outcome variable**.  We want to determine if we can **predict** air pollution levels based on other types of data, like road density and population density to see if we can use these data to predict air pollution in areas where there are no monitors. 


### Our outcome variable

The monitor data that we will be using comes from **[gravimetric monitors](https://publiclab.org/wiki/filter-pm){target="_blank"}** operated by the US [Enivornmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}. These monitors use a filtration system to specifically capture fine particulate matter. The weight of this matter is manually measured daily or weekly. See [here](https://www3.epa.gov/ttnamti1/files/ambient/pm25/spec/RTIGravMassSOPFINAL.pdf){target="_blank"} for the EPA standard operating procedure for PM gravimetric analysis in 2008.



```{r, echo = FALSE, out.width="150px"}
knitr::include_graphics(here::here("img","filter.png"))
```

##### [[source](https://publiclab.org/wiki/filter-pm)]{target="_blank"}

Here is an image of what the gravimetric monitors look like:

```{r, echo = FALSE, out.width="100px"}
knitr::include_graphics(here::here("img","monitor.png"))
```


Gravimetric analysis is also used for [emission testing](https://www.mt.com/us/en/home/applications/Laboratory_weighing/emissions-testing-particulate-matter.html){target="_blank"}. The same idea applies: a fresh filter is applied and the desired amount of time passes, then the filter is removed and weighed. 

There are [other monitoring systems](https://www.sensirion.com/en/about-us/newsroom/sensirion-specialist-articles/particulate-matter-sensing-for-air-quality-measurements/){target="_blank"} that can provide hourly measurements, but we will not be using data from these monitors in our analysis. Gravimetric analysis is considered to be among the most accurate methods.

In our csv, the **value** column indicates the PM~2.5~ monitor average for 2008 in mass of fine particles/volume of air for 876 gravimetric monitors. The units are micro grams of fine particulate mater (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m^3^).  Recall the the WHO exposure  guideline is < 10 ug/m^3^ on average annually for PM~2.5~.

### Our predictor variables

There are 48 predictor variables with values for each of the 876 monitors included in our outcome variable. The data comes from the US [Enivornmental Protection Agency (EPA)](https://www.epa.gov/){target="_blank"}, the [National Aeronautics and Space Administration (NASA)](https://www.nasa.gov/){target="_blank"}, the US [Census](https://www.census.gov/about/what/census-at-a-glance.html){target="_blank"}, and the [National Center for Health Statistics (NCHS)](https://www.cdc.gov/nchs/about/index.htm){target="_blank"}.

<details><summary> Click here to see a table about the variables </summary>


Variable   | Details                                                                        
---------- |-------------
**id**  | Monitor number  <br> -- the county number is indicated before the decimal <br> -- the monitor number is indicated after the decimal <br>  **Example**: 1073.0023  is Jefferson county (1073) and .0023 one of 8 monitors 
**fips** | Federal information processing standard number for the county where the monitor is located <br> -- 5 digit id code for counties (zero is often the first value and sometimes is not shown) <br> -- the first 2 numbers indicate the state <br> -- the last three numbers indicate the county <br>  **Example**: Alabama's state code is 01 because it is first alphabetically <br> (note: Alaska and Hawaii are not included because they are not part of the contiguous US)  
**Lat** | Latitude of the monitor in degrees  
**Lon** | Longitude of the monitor in degrees  
**state** | State where the monitor is located
**county** | County where the monitor is located
**city** | City where the monitor is located
**CMAQ**  | Estimated values of air pollution from a computational model called [**Community Multiscale Air Quality (CMAQ)**](https://www.epa.gov/cmaq){target="_blank"} <br> --  A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution <br> -- ***Does not use any of the PM~2.5~ gravimetric monitoring data.*** (There is a version that does use the gravimetric monitoring data, but not this one!) <br> -- Data from the EPA
**zcta** | [Zip Code Tabulation Area](https://www2.census.gov/geo/pdfs/education/brochures/ZCTAs.pdf){target="_blank"} where the monitor is located <br> -- Postal Zip codes are converted into "generalized areal representations" that are non-overlapping  <br> -- Data from the 2010 Census  
**zcta_area** | Land area of the zip code area in meters squared  <br> -- Data from the 2010 Census  
**zcta_pop** | Population in the zip code area  <br> -- Data from the 2010 Census  
**imp_a500** | Impervious surface measure <br> -- Within a circle with a radius of 500 meters around the monitor <br> -- Impervious surface are roads, concrete, parking lots, buildings <br> -- This is a measure of development 
**imp_a1000** | Impervious surface measure <br> --  Within a circle with a radius of 1000 meters around the monitor
**imp_a5000** | Impervious surface measure <br> --  Within a circle with a radius of 5000 meters around the monitor  
**imp_a10000** | Impervious surface measure <br> --  Within a circle with a radius of 10000 meters around the monitor   
**imp_a15000** | Impervious surface measure <br> --  Within a circle with a radius of 15000 meters around the monitor  
**county_area** | Land area of the county of the monitor in meters squared  
**county_pop** | Population of the county of the monitor  
**Log_dist_to_prisec** | Log (Natural log) distance to a primary or secondary road from the monitor <br> -- Highway or major road  
**log_pri_length_5000** | Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_10000** | Count of primary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_15000** | Count of primary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log) <br> -- Highways only  
**log_pri_length_25000** | Count of primary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log) <br> -- Highways only  
**log_prisec_length_500** | Count of primary and secondary road length in meters in a circle with a radius of 500 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_1000** | Count of primary and secondary road length in meters in a circle with a radius of 1000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_5000** | Count of primary and secondary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_10000** | Count of primary and secondary road length in meters in a circle with a radius of 10000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_15000** | Count of primary and secondary road length in meters in a circle with a radius of 15000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads  
**log_prisec_length_25000** | Count of primary and secondary road length in meters in a circle with a radius of 25000 meters around the monitor (Natural log)  <br> -- Highway and secondary roads      
**log_nei_2008_pm25_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)    
**log_nei_2008_pm25_sum_15000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)     
**log_nei_2008_pm25_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)     
**log_nei_2008_pm10_sum_10000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)      
**log_nei_2008_pm10_sum_15000**| Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 15000 meters of distance around the monitor (Natural log)      
**log_nei_2008_pm10_sum_25000** | Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)      
**popdens_county** | Population density (number of people per kilometer squared area of the county)
**popdens_zcta** | Population density (number of people per kilometer squared area of zcta)
**nohs** | Percentage of people in zcta area where the monitor is that **do not have a high school degree** <br> -- Data from the Census
**somehs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was **some high school education** <br> -- Data from the Census
**hs** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing a **high school degree** <br> -- Data from the Census  
**somecollege** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing **some college education** <br> -- Data from the Census 
**associate** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was completing an **associate degree** <br> -- Data from the Census 
**bachelor** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **bachelor's degree** <br> -- Data from the Census 
**grad** | Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **graduate degree** <br> -- Data from the Census 
**pov** | Percentage of people in zcta area where the monitor is that lived in [**poverty**](https://aspe.hhs.gov/2008-hhs-poverty-guidelines) in 2008 - or would it have been 2007 guidelines??https://aspe.hhs.gov/2007-hhs-poverty-guidelines <br> -- Data from the Census  
**hs_orless** |  Percentage of people in zcta area where the monitor whose highest formal educational attainment was a **high school degree or less** (sum of nohs, somehs, and hs)  
**urc2013** | [2013 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br>  -- Data from the National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}     
**urc2006** | [2006 Urban-rural classification](https://www.cdc.gov/nchs/data/series/sr_02/sr02_154.pdf){target="_blank"} of the county where the monitor is located <br> -- 6 category variable - 1 is totally urban 6 is completely rural <br> -- Data from the [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm){target="_blank"}     
**aod** | Aerosol Optical Depth measurement from a NASA satellite <br> -- based on the diffraction of a laser <br> -- used as a proxy of particulate pollution <br> -- unit-less - higher value indicates more pollution <br> -- Data from NASA  

</details>


Many of these predictor variables have to do with the circular area around the monitor called the "buffer". These are illustrated in the following figure:

```{r, echo = FALSE, out.width = "800px",}
knitr::include_graphics(here::here("img", "regression.png"))
```

##### [[source](https://www.ncbi.nlm.nih.gov/pubmed/15292906)]{target="_blank"}



## Data Import

We have one CSV file that contains both our single **outcome variable** and all of our **predictor variables**.

Let's import our data into R now so that we can explore the data further. We will call our data object `pm` for particulate matter.

```{r}
pm <-readr::read_csv(here("docs", "pm25_data.csv"))
```

## Data Exploration and Wrangling

The first step in performing a machine learning analysis is to explore the data to better understand the variables  included in the data, as we may learn about important details about the data that we should keep in mind as we try to predict our outcome variable.

First let's just get a general sense of our data. We can do that using the `glimpse()` function of the `dplyr` package (it is also in the `tibble` package).

We will also use the `%>%` pipe which can be used to define the input for later sequential steps. This will make more sense when we have multiple sequential steps using the same data object. To use the pipe notation we need to install and load dplyr as well.

For example here we will first grab the `pm` data object, then we use the `glimpse()` function on it based on the pipe notation.

#### {.scrollable }

```{r}
# Scroll through the output!
pm %>%
  dplyr::glimpse()
```

####

We can see that there are 876 monitors and that we have 50 total variables - one of which is the outcome. In this case our outcome variable is called `value`. 

Notice that some of the variables that we would think of as factors (categorical) are currently of class double as indicated by the `<dbl>` just to the right of the column names/variable names in the `glimpse()` output. For example the monitor ID (id), the Federal Information Processing Standard number for the county where the monitor was located (fips), as well as the zcta

Let's convert these variables into factors. We can do this using the `mutate_at()` function of the `dplyr` package and the `as.factor()` base function. 

In this case we are also using the magrittr assignment pipe or double pipe that looks like this `%<>%` of the `magrittr` package. This allows us use the `pm` data as input but also reassign the output to the same data object name.


#### {.scrollable }

```{r}
# Scroll through the output!
pm %<>%
  dplyr::mutate_at(vars(id, fips, zcta), as.factor) 

glimpse(pm)
```

####

Great! Now we can see that these variables are now factors as indicated by `<fct>` after the variable name.

### Packages to get a sense of the data

The `skim()` function of the `skimr` package is also really helpful for getting a general sense of your data.


#### {.scrollable }

```{r}
# Scroll through the output!
skim(pm)
```

####
Notice how there is a column called `n_missing` about the number of values that are missing. It looks like our data is very complete and we do not have any missing data. This is also indicated by the `complete_rate` variable, which shows the ratio of completeness, in our case all variables have a value of 1 indicating they are fully complete.

The `n_unqiue` column shows us the number of unique values for each of our columns. We can see that there are 49 states represented in the data, and we know that the data should be of the contiguous states. Let's take a look to see which states are included:

#### {.scrollable }
```{r}
# Scroll through the output!
pm %>% 
  distinct(state) %>%
  print(n = 1e3)
```
####

Looks like "District of Columbia" is being included as a state. We can see that indeed Alaska and Hawaii are not included in the data.

Here is another method of looking at the data using the `dfSummary()` function of the `summarytools`package. We need to copy and paste the output into the rmarkdown.

```{r, eval = FALSE}
dfSummary(pm, plain.ascii = FALSE, style = "grid", 
          graph.magnif = 0.45,  tmp.img.dir = "tmp")
```

<details><summary> Click here to see the dfSummary table </summary>


**Dimensions:** 876 x 50  
**Duplicates:** 0  

+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| No | Variable                     | Stats / Values                           | Freqs (% of Valid)  | Graph               | Valid  | Missing |
+====+==============================+==========================================+=====================+=====================+========+=========+
| 1  | id\                          | 1\. 1003.001\                            | 1 ( 0.1%)\          | ![](tmp/ds0101.png) | 876\   | 0\      |
|    | [factor]                     | 2\. 1027.0001\                           | 1 ( 0.1%)\          |                     | (100%) | (0%)    |
|    |                              | 3\. 1033.1002\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 4\. 1049.1003\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 5\. 1055.001\                            | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 6\. 1069.0003\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 7\. 1073.0023\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 8\. 1073.1005\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 9\. 1073.1009\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 10\. 1073.101\                           | 1 ( 0.1%)\          |                     |        |         |
|    |                              | [ 866 others ]                           | 866 (98.9%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 2  | value\                       | Mean (sd) : 10.8 (2.6)\                  | 875 distinct values | ![](tmp/ds0102.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 3 < 11.2 < 23.2\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 3.1 (0.2)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 3  | fips\                        | 1\. 1003\                                | 1 ( 0.1%)\          | ![](tmp/ds0103.png) | 876\   | 0\      |
|    | [factor]                     | 2\. 1027\                                | 1 ( 0.1%)\          |                     | (100%) | (0%)    |
|    |                              | 3\. 1033\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 4\. 1049\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 5\. 1055\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 6\. 1069\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 7\. 1073\                                | 8 ( 0.9%)\          |                     |        |         |
|    |                              | 8\. 1089\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 9\. 1097\                                | 2 ( 0.2%)\          |                     |        |         |
|    |                              | 10\. 1101\                               | 1 ( 0.1%)\          |                     |        |         |
|    |                              | [ 559 others ]                           | 858 (98.0%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 4  | lat\                         | Mean (sd) : 38.5 (4.6)\                  | 876 distinct values | ![](tmp/ds0104.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 25.5 < 39.3 < 48.4\                      |                     |                     |        |         |
|    |                              | IQR (CV) : 6.6 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 5  | lon\                         | Mean (sd) : -91.7 (15)\                  | 876 distinct values | ![](tmp/ds0105.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | -124.2 < -87.5 < -68\                    |                     |                     |        |         |
|    |                              | IQR (CV) : 18.5 (-0.2)                   |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 6  | state\                       | 1\. California\                          | 85 ( 9.7%)\         | ![](tmp/ds0106.png) | 876\   | 0\      |
|    | [character]                  | 2\. Ohio\                                | 44 ( 5.0%)\         |                     | (100%) | (0%)    |
|    |                              | 3\. Illinois\                            | 38 ( 4.3%)\         |                     |        |         |
|    |                              | 4\. Indiana\                             | 36 ( 4.1%)\         |                     |        |         |
|    |                              | 5\. North Carolina\                      | 35 ( 4.0%)\         |                     |        |         |
|    |                              | 6\. Pennsylvania\                        | 32 ( 3.7%)\         |                     |        |         |
|    |                              | 7\. Michigan\                            | 30 ( 3.4%)\         |                     |        |         |
|    |                              | 8\. Florida\                             | 29 ( 3.3%)\         |                     |        |         |
|    |                              | 9\. Georgia\                             | 28 ( 3.2%)\         |                     |        |         |
|    |                              | 10\. Texas\                              | 27 ( 3.1%)\         |                     |        |         |
|    |                              | [ 39 others ]                            | 492 (56.2%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 7  | county\                      | 1\. Jefferson\                           | 18 ( 2.1%)\         | ![](tmp/ds0107.png) | 876\   | 0\      |
|    | [character]                  | 2\. Cook\                                | 12 ( 1.4%)\         |                     | (100%) | (0%)    |
|    |                              | 3\. Hamilton\                            | 11 ( 1.3%)\         |                     |        |         |
|    |                              | 4\. Lake\                                | 11 ( 1.3%)\         |                     |        |         |
|    |                              | 5\. Los Angeles\                         | 10 ( 1.1%)\         |                     |        |         |
|    |                              | 6\. Wayne\                               | 10 ( 1.1%)\         |                     |        |         |
|    |                              | 7\. Washington\                          | 9 ( 1.0%)\          |                     |        |         |
|    |                              | 8\. Cuyahoga\                            | 7 ( 0.8%)\          |                     |        |         |
|    |                              | 9\. Jackson\                             | 7 ( 0.8%)\          |                     |        |         |
|    |                              | 10\. Madison\                            | 7 ( 0.8%)\          |                     |        |         |
|    |                              | [ 461 others ]                           | 774 (88.4%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 8  | city\                        | 1\. Not in a city\                       | 103 (11.8%)\        | ![](tmp/ds0108.png) | 876\   | 0\      |
|    | [character]                  | 2\. New York\                            | 9 ( 1.0%)\          |                     | (100%) | (0%)    |
|    |                              | 3\. Cleveland\                           | 6 ( 0.7%)\          |                     |        |         |
|    |                              | 4\. Baltimore\                           | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 5\. Chicago\                             | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 6\. Detroit\                             | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 7\. Milwaukee\                           | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 8\. New Haven\                           | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 9\. Philadelphia\                        | 5 ( 0.6%)\          |                     |        |         |
|    |                              | 10\. Springfield\                        | 5 ( 0.6%)\          |                     |        |         |
|    |                              | [ 597 others ]                           | 723 (82.5%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 9  | CMAQ\                     | Mean (sd) : 8.4 (3)\                     | 601 distinct values | ![](tmp/ds0109.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 1.6 < 8.6 < 23.1\                        |                     |                     |        |         |
|    |                              | IQR (CV) : 3.7 (0.4)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 10 | zcta\                        | 1\. 1022\                                | 1 ( 0.1%)\          | ![](tmp/ds0110.png) | 876\   | 0\      |
|    | [factor]                     | 2\. 1103\                                | 2 ( 0.2%)\          |                     | (100%) | (0%)    |
|    |                              | 3\. 1201\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 4\. 1608\                                | 2 ( 0.2%)\          |                     |        |         |
|    |                              | 5\. 1832\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 6\. 1840\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 7\. 1863\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 8\. 1904\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 9\. 2113\                                | 1 ( 0.1%)\          |                     |        |         |
|    |                              | 10\. 2119\                               | 1 ( 0.1%)\          |                     |        |         |
|    |                              | [ 832 others ]                           | 864 (98.6%)         |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 11 | zcta_area\                   | Mean (sd) : 183173481.9 (542598878.5)\   | 842 distinct values | ![](tmp/ds0111.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 15459 < 37653560.5 < 8164820625\         |                     |                     |        |         |
|    |                              | IQR (CV) : 145836906.5 (3)               |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 12 | zcta_pop\                    | Mean (sd) : 24227.6 (17772.2)\           | 837 distinct values | ![](tmp/ds0112.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 22014 < 95397\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 25207.8 (0.7)                 |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 13 | imp_a500\                    | Mean (sd) : 24.7 (19.3)\                 | 816 distinct values | ![](tmp/ds0113.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 25.1 < 69.6\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 36.5 (0.8)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 14 | imp_a1000\                   | Mean (sd) : 24.3 (18)\                   | 860 distinct values | ![](tmp/ds0114.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 24.5 < 67.5\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 33.3 (0.7)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 15 | imp_a5000\                   | Mean (sd) : 19.9 (14.7)\                 | 870 distinct values | ![](tmp/ds0115.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0.1 < 19.1 < 74.6\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 23.3 (0.7)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 16 | imp_a10000\                  | Mean (sd) : 15.8 (13.8)\                 | 870 distinct values | ![](tmp/ds0116.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0.1 < 12.4 < 72.1\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 19.6 (0.9)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 17 | imp_a15000\                  | Mean (sd) : 13.4 (13.1)\                 | 870 distinct values | ![](tmp/ds0117.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0.1 < 9.7 < 71.1\                        |                     |                     |        |         |
|    |                              | IQR (CV) : 17.3 (1)                      |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 18 | county_area\                 | Mean (sd) : 3768701992.1 (6212829553.6)\ | 564 distinct values | ![](tmp/ds0118.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 33703512 < 1690826566.5 < 51947229509\   |                     |                     |        |         |
|    |                              | IQR (CV) : 1761655911.5 (1.6)            |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 19 | county_pop\                  | Mean (sd) : 687298.4 (1293488.7)\        | 564 distinct values | ![](tmp/ds0119.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 783 < 280730.5 < 9818605\                |                     |                     |        |         |
|    |                              | IQR (CV) : 642211 (1.9)                  |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 20 | log_dist_to_prisec\          | Mean (sd) : 6.2 (1.4)\                   | 870 distinct values | ![](tmp/ds0120.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | -1.5 < 6.4 < 10.5\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 1.7 (0.2)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 21 | log_pri_length_5000\         | Mean (sd) : 9.8 (1.1)\                   | 586 distinct values | ![](tmp/ds0121.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 8.5 < 10.1 < 12\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 2.2 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 22 | log_pri_length_10000\        | Mean (sd) : 10.9 (1.1)\                  | 687 distinct values | ![](tmp/ds0122.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 9.2 < 11.2 < 13\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 2 (0.1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 23 | log_pri_length_15000\        | Mean (sd) : 11.5 (1.1)\                  | 726 distinct values | ![](tmp/ds0123.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 9.6 < 11.7 < 13.6\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 1.5 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 24 | log_pri_length_25000\        | Mean (sd) : 12.2 (1.1)\                  | 787 distinct values | ![](tmp/ds0124.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 10.1 < 12.5 < 14.4\                      |                     |                     |        |         |
|    |                              | IQR (CV) : 1.4 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 25 | log_prisec_length_500\       | Mean (sd) : 7 (1)\                       | 382 distinct values | ![](tmp/ds0125.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 6.2 < 6.2 < 9.4\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 1.6 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 26 | log_prisec_length_1000\      | Mean (sd) : 8.6 (0.8)\                   | 591 distinct values | ![](tmp/ds0126.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 7.6 < 8.7 < 10.5\                        |                     |                     |        |         |
|    |                              | IQR (CV) : 1.6 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 27 | log_prisec_length_5000\      | Mean (sd) : 11.3 (0.8)\                  | 852 distinct values | ![](tmp/ds0127.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 8.5 < 11.4 < 12.8\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 0.9 (0.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 28 | log_prisec_length_10000\     | Mean (sd) : 12.4 (0.7)\                  | 867 distinct values | ![](tmp/ds0128.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 9.2 < 12.5 < 13.8\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 1 (0.1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 29 | log_prisec_length_15000\     | Mean (sd) : 13 (0.7)\                    | 869 distinct values | ![](tmp/ds0129.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 9.6 < 13.1 < 14.4\                       |                     |                     |        |         |
|    |                              | IQR (CV) : 1 (0.1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 30 | log_prisec_length_25000\     | Mean (sd) : 13.8 (0.7)\                  | 870 distinct values | ![](tmp/ds0130.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 10.1 < 13.9 < 15.2\                      |                     |                     |        |         |
|    |                              | IQR (CV) : 1 (0.1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 31 | log_nei_2008_pm25_sum_10000\ | Mean (sd) : 4 (2.4)\                     | 828 distinct values | ![](tmp/ds0131.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 4.3 < 9.1\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 3.5 (0.6)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 32 | log_nei_2008_pm25_sum_15000\ | Mean (sd) : 4.7 (2.2)\                   | 855 distinct values | ![](tmp/ds0132.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 5 < 9.4\                             |                     |                     |        |         |
|    |                              | IQR (CV) : 2.9 (0.5)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 33 | log_nei_2008_pm25_sum_25000\ | Mean (sd) : 5.7 (2.1)\                   | 860 distinct values | ![](tmp/ds0133.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 5.9 < 9.7\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 2.6 (0.4)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 34 | log_nei_2008_pm10_sum_10000\ | Mean (sd) : 4.3 (2.3)\                   | 829 distinct values | ![](tmp/ds0134.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 4.6 < 9.3\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 3.4 (0.5)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 35 | log_nei_2008_pm10_sum_15000\ | Mean (sd) : 5.1 (2.2)\                   | 855 distinct values | ![](tmp/ds0135.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 5.4 < 9.7\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 2.8 (0.4)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 36 | log_nei_2008_pm10_sum_25000\ | Mean (sd) : 6.1 (2)\                     | 860 distinct values | ![](tmp/ds0136.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 6.4 < 9.9\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 2.4 (0.3)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 37 | popdens_county\              | Mean (sd) : 551.8 (1711.5)\              | 564 distinct values | ![](tmp/ds0137.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0.3 < 156.7 < 26821.9\                   |                     |                     |        |         |
|    |                              | IQR (CV) : 470 (3.1)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 38 | popdens_zcta\                | Mean (sd) : 1279.7 (2757.5)\             | 840 distinct values | ![](tmp/ds0138.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 610.3 < 30418.8\                     |                     |                     |        |         |
|    |                              | IQR (CV) : 1281.4 (2.2)                  |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 39 | nohs\                        | Mean (sd) : 7 (7.2)\                     | 215 distinct values | ![](tmp/ds0139.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 5.1 < 100\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 6.1 (1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 40 | somehs\                      | Mean (sd) : 10.2 (6.2)\                  | 230 distinct values | ![](tmp/ds0140.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 9.4 < 72.2\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 8 (0.6)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 41 | hs\                          | Mean (sd) : 30.3 (11.4)\                 | 347 distinct values | ![](tmp/ds0141.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 30.8 < 100\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 12.3 (0.4)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 42 | somecollege\                 | Mean (sd) : 21.6 (8.6)\                  | 240 distinct values | ![](tmp/ds0142.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 21.3 < 100\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 7.2 (0.4)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 43 | associate\                   | Mean (sd) : 7.1 (4)\                     | 157 distinct values | ![](tmp/ds0143.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 7.1 < 71.4\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 3.9 (0.6)                     |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 44 | bachelor\                    | Mean (sd) : 14.9 (9.7)\                  | 301 distinct values | ![](tmp/ds0144.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 12.9 < 100\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 10.4 (0.7)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 45 | grad\                        | Mean (sd) : 8.9 (8.6)\                   | 245 distinct values | ![](tmp/ds0145.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 6.7 < 100\                           |                     |                     |        |         |
|    |                              | IQR (CV) : 7.1 (1)                       |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 46 | pov\                         | Mean (sd) : 15 (11.3)\                   | 345 distinct values | ![](tmp/ds0146.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 12.1 < 65.9\                         |                     |                     |        |         |
|    |                              | IQR (CV) : 14.7 (0.8)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 47 | hs_orless\                   | Mean (sd) : 47.5 (16.8)\                 | 464 distinct values | ![](tmp/ds0147.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 0 < 48.7 < 100\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 21.2 (0.4)                    |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 48 | urc2013\                     | Mean (sd) : 2.9 (1.5)\                   | 1 : 203 (23.2%)\    | ![](tmp/ds0148.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        | 2 : 163 (18.6%)\    |                     | (100%) | (0%)    |
|    |                              | 1 < 3 < 6\                               | 3 : 228 (26.0%)\    |                     |        |         |
|    |                              | IQR (CV) : 2 (0.5)                       | 4 : 123 (14.0%)\    |                     |        |         |
|    |                              |                                          | 5 : 101 (11.5%)\    |                     |        |         |
|    |                              |                                          | 6 :  58 ( 6.6%)     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 49 | urc2006\                     | Mean (sd) : 3 (1.5)\                     | 1 : 195 (22.3%)\    | ![](tmp/ds0149.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        | 2 : 162 (18.5%)\    |                     | (100%) | (0%)    |
|    |                              | 1 < 3 < 6\                               | 3 : 221 (25.2%)\    |                     |        |         |
|    |                              | IQR (CV) : 2 (0.5)                       | 4 : 127 (14.5%)\    |                     |        |         |
|    |                              |                                          | 5 : 115 (13.1%)\    |                     |        |         |
|    |                              |                                          | 6 :  56 ( 6.4%)     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+
| 50 | aod\                         | Mean (sd) : 43.7 (19.6)\                 | 581 distinct values | ![](tmp/ds0150.png) | 876\   | 0\      |
|    | [numeric]                    | min < med < max:\                        |                     |                     | (100%) | (0%)    |
|    |                              | 5 < 40.2 < 143\                          |                     |                     |        |         |
|    |                              | IQR (CV) : 18 (0.4)                      |                     |                     |        |         |
+----+------------------------------+------------------------------------------+---------------------+---------------------+--------+---------+

</details>

We can see that for many variables there are many low values as the distribution shows two peaks, one near zero and another with a higher value. This is true for the imp variables (measures of development), the nei variables (measures of emission sources) and the road density variables. We can also see that the range of some of the variables is very large, in particular the area and population related variables.

### Evaluate correlation among possible predictors
In prediction analyses, it is also useful to evaluate if any of the variables are correlated.

Intuitively we can expect some of our variables to be correlated.

Let's first take a look at all of our numeric variabels with the`corrplot` package:
The `corrplot` package is another option to look at correlation among possible predictors. This is a great option if we have many predictors. 
First we need to create a correlation matrix using the `cor()` function of the `stats` package (which is loaded automatically).

```{r}
#library(RColorBrewer) #need this package for the color
PM_cor <- cor(pm %>% dplyr::select_if(is.numeric))
corrplot::corrplot(PM_cor, tl.cex = 0.5)

corrplot(abs(PM_cor), order = "AOE", tl.cex = 0.5,  cl.lim = c(0, 1))


corrplot(PM_cor, diag = FALSE, order = "FPC",
         tl.pos = "td", tl.cex = 0.5, method = "color", type = "upper", col = brewer.pal(n = 8, name = "PuOr"))

corrplot(PM_cor, diag = FALSE,
         tl.pos = "td", tl.cex = 0.5, method = "color", type = "upper")

```

 Using `ggcorplot` package
```{r}
ggcorrplot(PM_cor, hc.order = TRUE, type = "lower", tl.cex = 5)
```

We can see that the the development variables (imp) variables are correlated with each other as we might expect. We also see that the road density variables seem to be correlated with each other, and the emission variables seem to be correlated with each other. We can take a closer look  using the `ggcorr()` function and the `ggpairs()` function of the `GGally` package. To select our variables of interest we can use the `select()` function with the `contains()` function of the `tidyr` package. 

First let's look at the imp/development variables. 
```{r, out.width = "400px"}
select(pm, contains("imp")) %>%
  ggcorr(palette = "RdBu", label = TRUE)

select(pm, contains("imp")) %>%
  ggpairs()
  
```

Indeed, we can see that imp_a1000 and imp_a500 are perfectly correlated, as well as imp_a10000, imp_a15000.


Now let's take a look at the road density data:

```{r, fig.weight=12}
select(pm, contains("pri")) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

```

We can see that many of the road density variables are highly correlated with one another, while others are less so.

Finally let's look at the emission variables.

```{r}
select(pm, contains("nei")) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

select(pm, contains("nei")) %>%
  ggpairs()
```

We would also expect the population density data might correlate with some of these variables. Let's take a look.

```{r}
pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county, log_pri_length_10000, imp_a10000, county_pop) %>%
  ggpairs()
```


Interesting, so these variables don't appear to be highly correlated, therefore we might need variables from each of the categories to predict our monitor PM~2.5~ pollution values.

We seem to have some pretty extreme population values though, so let's see what happens when we take the log value.

```{r}
pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, log_pri_length_10000, imp_a10000) %>%
  ggcorr(palette = "RdBu",  hjust = .85, size = 3,
       layout.exp=2, label = TRUE)

pm %>%
  mutate(log_popdens_county= log(popdens_county)) %>%
  mutate(log_pop_county = log(county_pop)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county, log_pri_length_10000, imp_a10000, log_pop_county) %>%
  ggpairs()
```

Indeed this increased the correlation, but variables from each of these categories may still prove to be useful for prediction.




## Data Analysis

Now that we have a sense of what our data is like we can get started with data analysis.

### The machine learning process

There are two major types of machine learning:  

1) Unsupervised  
2) Supervised  

Unsupervised learning is used to learn about the structure of the data without knowing much about the data. We let the data reveal properties about itself. Examples include clustering the data into groups or reducing the dimensionality of the data using methods like [principal component analysis](https://medium.com/@savastamirko/pca-a-linear-transformation-f8aacd4eb007){target="_blank"} (which we will describe in more detail later) to capture patterns of variance within the data.

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://miro.medium.com/max/1400/1*lhkCOodCMZ0-SSziEDpwpA.png")
```

#### [[source](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d)]{target="_blank"}

In contrast, in supervised learning we have some knowledge about the data that we want to use to create a model to be able to generalize about other similar data.

There are two distinct goals of supervised machine learning:  

1) Prediction  
2) Classification  

```{r, out.width = "80%", echo = FALSE, fig.align ="center"}
include_graphics("https://miro.medium.com/max/1400/1*ASYpFfDh7XnreU-ygqXonw.png")
```

#### [[source](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d)]{target="_blank"}

We will be performing a prediction analysis (which is also referred to as regression), which aims to predict **continuous outcome** variables given a number of predictors/explanatory variables/features/parameters, as we have already described.

Classification on the other hand aims to discern or predict group identity for a **categorical outcome** based on a number of predictors/explanatory variables/features/parameters.

The overall process is the same in either case and involves the following steps (which will each be explained in detail): 

1) Data exploration  

We have already performed this step to get a sense of the data. It is important to know if we have `NA` values, to understand the class of variables, and if to determine if there are any redundant variables that might need to removed.

2) Data splitting 

The data needs to be split into two pieces: a training set and a testing set. The training set will be used to optimize the model, while the testing set will be used to evaluate model performance. 

3) Variable assignment and preprocessing  

Both the training and testing data needs to be processed so that the data is compatible and optimized to be used with the model. This involves assigning variables to specific roles within the model and preprocessing like scaling variables and removing redundant variables. This process is also called feature engineering.

4) Model specification, fitting, tuning and performance evaluation using the training data

The model needs to first be fit to the training data. First the method or algorithm in which the model will be fit is specified (regression, random forest etc.).  Then in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. If the model fits well than these estimated values will be very similar to the true outcome variable values. If the model does not fit well, than these estimates will be more disimilar from the true outcome variable. In this case, aspects about the model may need to be modified to improve the similarity of the estimates with that of the true outcome values. One way to optimize model performance is a process called tuning in which different model [hyper-parameter](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models){target="_blank"} options are tested to determine the best option for model performance.

5) Overall model performance evaluation

Model performance is assessed as the similarity between the estimates of the outcome variable produced by the model and the true outcome variable values. This is done typically as an iterative process with the training data along side modification of the model until the performance using the training data is satisfactory. At this point, the final model performance is assessed using the testing data. This then gives an estimate about how well the model will predict or classify the outcome variable of interest with new independent data. Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources.

### The tidymodels ecosystem

To perform our analysis we will be using the `tidymodels` suite of packages. You may be familiar with the older packages `caret` or `mlr` which are also for machine learning and modeling but are not a part of the `tidyverse`. [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"} describes `tidymodels` like this:

> "Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: preprocessing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret.
The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do."

There are many packages in the tidymodels ecosystem which assist with the various steps of the machine learning process:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","simpletidymodels.png"))
```


This is a depiction of how these tools help perform the overall machine learning process:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","MachineLearning.png"))
```

### The major benefits of tidymodels  

1) Standardized workflow/format/notation across different types of algorithms  

Different notations are required for different algorithms as the algorithms have been developed by many different people. This would require the painstaking process of reformatting the data to be compatible with each algorithm if multiple algorithms were tested.

2) Can easily modify preprocessing, algorithm choice, and hyper-parameter tuning making optimization easy  

Modifying a piece of the overall process is now much easier than before because many of the steps are specified using the tidymodel packages in a convenient manner. Thus the entire process can be rerun after a simple change to preprocessing without much difficulty.

### Splitting the Data

The first step after data exploration in machine learning analysis is to [split the data](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7){target="_blank"} into **training** and **testing** datasets. 

The training dataset will be used to build and tune our model. This is the data that the model "learns" on.

The testing set will be used to evaluate the performance of our model in a more generalizable way. What do we mean by "generalizable"?

Remember that our main goal is to use our model to be able to predict air pollution levels in areas where there are no gravimetric monitors. Therefore, if our model is super good at predicting air pollution with the data that we use to build it, it might not do the best job for the areas where there are few to no monitors. This would cause us to have really good prediction accuracy and we might assume that we were going to do a good job estimating air pollution any time we use our model, but in fact this would likely not be the case. This situation is what we call **[overfitting](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6){target="_blank"} **.

Overfitting happens when we end up modeling not only the major relationships in our data but also the noise within our data. 


```{r}
knitr::include_graphics("https://miro.medium.com/max/1110/1*tBErXYVvTw2jSUYK7thU2A.png")
```

##### [[source](https://miro.medium.com/max/1110/1*tBErXYVvTw2jSUYK7thU2A.png)]{target="_blank"}

If we get fairly good prediction with our testing set then we will know that our model can be applied to other data and will perform fairly well. We will discuss this more later.

We will not touch the testing set until we have completed optimizing our model with the training set. This will allow us to have a less biased evaluation of how well our model can do with other data besides the data used in the training set to build the model. Ideally you would also want a completely independent dataset to further test the performance of your model.

[Here](https://machinelearningmastery.com/difference-test-validation-datasets/){target="_blank"} is a great description of the differences between testing and training datasets.

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","split.png"))
```
We will use the `rsample` package to perform this step.

The`initial_split()` function allows us to specify how we want to split our data. Typically data is split into 3/4 for training and 1/4 for testing.This is the default proportion and does not need to be specified. However you can change the proportion using the `prop` argument, which we will do that here for illustrative purposes. You can also specify a variable to stratify by with the `strata` argument. This is useful if you have imbalanced categorical variables and you would like to intentionally make sure that there are similar number of samples of the rarer categories in both the testing and training sets. Otherwise the split is performed randomly. 

> The strata argument causes the random sampling to be conducted within the stratification variable. The can help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.

In the case with our dataset, perhaps we would like our training set to have similar proportions of monitors from each of the states as in the initial data. This might be useful if we want our model to be generalizable across all of the states.

We can see that indeed there are different proportions of monitors in each state by using the `count()` function of the `dpyr` package. 

#### {.scrollable }
```{r}
# Scroll through the output!
count(pm, state) %>%
  print(n = 1e3)
```
####

If our dataset were large enough it might be nice then to stratify by state, but our data is unfortunately not large enough. We will show how one would do this though for illustrative purposes. This option is often more important for classification applications of machine learning than it is for prediction.

Since the split is performed randomly, it is a good idea to use the `set.seed()` base function to ensure that if your rerun your code that your split will be the same next time. We can see the number of monitors in our training, testing, and original data by typing in the name of our split object. The result will look like this:
<training data sample number, testing data sample number, original sample number> 

```{r}
set.seed(1234)
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split

# If stratifying:
# pm_split_strata <-rsample::initial_split(data = pm, prop = 2/3, strata = "state")

```

Importantly the `initial_split` function only determines what rows of our pm data frame should be assigned for training or testing, it does not actually split the data. 

To extract the testing and training data we can use the `training()` and `testing()` functions also of the `rsample` package.

#### {.scrollable }
```{r}
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)
 
# Scroll through the output!
count(train_pm, state)
count(test_pm, state)
```
####



### Variable Role Assignment and Preprocessing

In tidymodels we will create a recipe, which is a standardized format for a sequence of steps for processing the data.

This can be very useful because it makes testing out different preprocessing steps or different algorithms with the same preprocessing very easy and reproducible.

**Creating a recipe specifies how a data frame of predictors should be created  - it specifies what  variables to be used  and the  preprocessing steps  but it does not execute these steps or create the data frame of predictors.**

#### List the ingredients / specify the variables with the `recipe()` function

The first thing to do to create a recipe is to specify which variables we will be using as our outcome and predictors using the `recipe()` function. In terms of the metaphor of baking, we can think of this as listing our ingredients. The naming convention for recipe object names is `*_rec` or `rec`. 

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","Starting_a_recipe_recipes1.png"))
```


In our case recall that our `value` variable, which is the average annual gravimetric monitor PM~2.5~ concentration in ug/m^3^. Our predictors are all the other variables except the monitor ID, which is an `id` variable.

The reason not to include this variable is because this variable includes the county number and a number designating which particular monitor the values came from of the monitors there are in that county. Since this number is arbitrary and the county information is also given in the data, and the fact that each monitor only has one value in the `value` variable, nothing is gained by including this variable and it may instead introduce noise. However, it is useful to keep this data to take a look at what is happening later. We will show you what to do in this case in just a bit.

The simplest recipe with no preprocessing steps, would be to simply list the outcome and predictor variables.

We can do so in two ways:  

1) Using formula notation  
2) Assigning roles to each variable  

Let's look at the first way using formula notation, which looks like this:  

outcome(s) ~ predictor(s)  

If in the case of multiple predictors or a multivariate situation with two outcomes, use a plus sign  

outcome1 + outcome2 ~ predictor1 + predictor2  

If we want to include all predictors we can use a period like so:  

outcome_variable_name ~ .  

Now with our data we will start by making a recipe for our training data. In the simplest case we might use all predictors like this:

```{r}

simple_rec <-train_pm %>%
  recipes::recipe(value ~ .)

simple_rec
```


However, to deal with the id variable we could use the `update_role()` function of the `recipes` package. This option works well with the newer `workflows` package, however id variables are often dropped from analyses that do not use this newer package as they can make the process difficult with using the `parsnip` package alone due to the fact that new levels (or possible values) may be introduced with the testing data.

```{r}

simple_rec <-train_pm %>%
  recipes::recipe(value ~ .) %>%
  recipes::update_role(id, new_role = "id variable")

simple_rec
```

We could also specify the outcome and predictors in the same way as the id variable. Please see [here](https://tidymodels.github.io/recipes/reference/recipe.html) for examples of other roles for variables. The role can be actually be any value. 

The order is important here, as we first make all variables predictors and then override this role for the outcome and id variable. We will use the `everything()` function of the `dplyr` package to start with all of the variables in `train_pm`.

```{r}

simple_rec <-recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable")

simple_rec

```

If we want to take a look at our formula from our recipe we can do use the `formula()` function of the `stats` package.

```{r}
formula(simple_rec)
```

We can also view our recipe in more detail using the base `summary()` function.

```{r}
summary(simple_rec)
```

#### List the preprocessing steps using the step functions of the `recipe` package

The other thing the recipes package allows for is specifying preprocessing steps using a variety of `step*()` functions.

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","Making_a_recipe_recipes2.png"))
```


**This [link](https://tidymodels.github.io/recipes/reference/index.html){target="_blank"} and this [link](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"} show the many options for recipe step functions.**

<u>There are step functions for a variety of purposes:</u>

1) [**Imputation**](https://en.wikipedia.org/wiki/Imputation_(statistics)){target="_blank"}  -- which means filling in missing values based on the existing data 
2) [**Transformation**](https://en.wikipedia.org/wiki/Data_transformation_(statistics)){target="_blank"}  -- which means changing all values of a variable in the same way, typically to make it more normal or easier to interpret)  
3) [**Discretization**](https://en.wikipedia.org/wiki/Discretization_of_continuous_features) -- which means converting continuous values into discrete or nominal values - binning for example to reduce the number of possible levels)  (However this is generally not advisable!)
4) [**Encoding / Creating Dummy Variables**](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) -- which means creating a numeric code for categorical variables
[**More on Dummy Variables and one hot encoding**](https://medium.com/p/b5840be3c41a/responses/show)
5) [**Data type conversions**](https://cran.r-project.org/web/packages/hablar/vignettes/convert.html) -- which means changing from integer to factor or numeric to date etc.
6) [**Interaction**](https://statisticsbyjim.com/regression/interaction-effects/) term addition to the model -- which means that we would be modeling for predictors that would influence the capacity of each other to predict the outcome
7) [**Normalization**](https://en.wikipedia.org/wiki/Normalization_(statistics)) -- which means centering and scaling the data to a similar range of values
8) [**Dimensionality Reduction/ Signal Extraction**](https://en.wikipedia.org/wiki/Dimensionality_reduction) -- which means mathematically obtaining a new smaller set of variables that capture the variation or signal in the original variables (ex. Principal Component Analysis and Independent Component Analysis)
9) **Filtering** -- Filtering options for removing variables (ex. remove variables that are highly correlated to others or remove variables with very little variance and therefore likely little predictive capacity)
10) [**Row operations**](https://tartarus.org/gareth/maths/Linear_Algebra/row_operations.pdf) -- which means performing functions on the values within the rows  (ex. rearranging, filtering, imputing)
11) **Checking functions** -- Sanity checks to look for missing values, to look at the variable classes etc.

All of the step functions look like `step_*` except for the check functions which look like `check_*`.

There are several ways to select what variables to apply steps to:  
1) tidyselect methods: `contains()`, `matches()`, `starts_with()`, `ends_with()`, `everything()`, `num_range()`  
2) based on the type: `all_nominal()`, `all_numeric()` , `has_type()` 
3) based on the role: `all_predictors()`, `all_outcomes()`, `has_role()`
4) name - use the actual name of the variable/variables of interest  


Let's try adding some steps to our recipe.

We might consider log transforming our population and area variables (that aren't densities) - let's take a look at the range of these variables.
```{r, eval = FALSE}
pm %>%
  select(matches("_pop|_area")) %>%
  map(range)
```
We can see that the range for each of these variables is quite large, we can log transform this data using the `step_log()` function of the `recipes` package.

We would also want to potentially [one hot encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) some of our categorical variables so that they can be used with certain algorithms. We can do this with the `step_dummy()` function and the `one_hot = TRUE` argument. one hot encoding means that we don't just simply encode our categorical variables numerically, as our numeric assignments can be interpreted by algorithms as having a particular rank or order. Instead, binary variables made of 1s and 0s are used to arbitrarily assign a numeric value that has no apparent order.

Our fips variable includes a numeric code for state and county - and therefore is essentially a proxy for county.  Since we already have county, we will just use it and keep the fips id as another ID variable.

We can remove the `fips` variable from the predictors using `update_role()` to make sure that the role is no longer `"predictor"`. We can make the role anything we want actually, so we will keep it something identifiable.

We might also want to remove variables that appear to be redundant and are highly correlated with others, as we know from our exploratory data analysis that many of our variables are correlated with one another. We can do this using the `step_corr()` function.

It is also a good idea to remove variables with near-zero variance, which can be done with the `step_nzv()` function. Variables have low variance if all the values are very similar, the values are very sparse, or if they are highly imbalanced. 

Examples where you might have near-zero variance variables include: 

1) **Similar Values** - If the population density was nearly the same for every zcta that contained a monitor, then knowing the population density near our monitor would contribute little to our model in assisting us to predict monitor air pollution values. 
2) **Sparse Data** - If all of the monitors were in locations where the populations did not attend graduate school, then these values would mostly be zero, again this would do very little to help us distinguish our air pollution monitors.When many of the values are zero this is also called sparse data.  
3) **Imbalanced Data** If nearly all of the monitors were located in one particular state, and all the others only had one monitor each, then the real predictive value would simply be in knowing if a monitor is located in that particular state or not. In this case we don't want to remove our variable, we just want to simplify it.

See this [blog post](https://www.r-bloggers.com/near-zero-variance-predictors-should-we-remove-them/) about why removing near-zero variance variables isn't always a good idea if we think that a variable might be especially informative.

**It is important to add the steps to the recipe in an order that makes sense just like with a cooking recipe.**

Thus first we are going to create numeric values for our categorical variables, then we will look at correlation and near-zero variance. We don't want to remove some of our variables, like the CMAQ and aod variables so we can make sure they are kept in the model by excluding them from those steps. If we specifically wanted to remove a predictor we could use `step_rm()`.

```{r}
simple_rec %<>%
  update_role("fips", new_role = "county id") %>%
# create numeric dummy variables to encode for categorical variables
  step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
  # identify and remove all correlated predictors (now that they are numeric)
  step_corr(all_predictors(), - CMAQ, - aod)%>%
  # identify variables with near zero variance and remove
  step_nzv(all_predictors(), - CMAQ, - aod)
  
simple_rec

```



### Running the preprocessing

The next major function of the `recipes` package is `prep()`.

This function updates the recipe object based on the training data. It estimates parameters (estimating the required quantities and statistics required by the steps for the variables) for preprocessing and  updates the model terms, as some of the predictors may be removed, this allows the recipe to be ready to use on other datasets. It doesn't necessarily actually execute the preprocessing itself, however we will specify in argument for it to do this so that we can take a look at the preprocessed data.

There are some important arguments to know about:
1) training - you must supply a training data set to estimate parameters for preprocessing operations (recipe steps) - this may already be included in your recipe - as is the case for us
2) fresh - if TRUE - will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe
3) verbose - if `TRUE` shows the progress as the steps are evaluated and the size of the preprocessed training set
4) retain - if `TRUE` then the preprocessed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and don't want to rerun the `prep()` on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the preprocessed data.


```{r}
prepped_rec <- prep(simple_rec, verbose = TRUE, retain = TRUE )
names(prepped_rec)
```

There are also lots of useful things to checkout in the output of `prep()`.
You can see:
1) the `steps` that were run  
2) the variable info (`var_info`)  
3) the model `term_info`
4) the new `levels` of the variables 
5) the original levels of the variables `orig_lvls`   
6) info about the training data set size and completeness (`tr_info`)

Note:  You may see the `prep.recipe()` function in material that you read about the `recipes` package. This is referring to the `prep()` function of the `recipes` package.

#### Extracting the preprocessed training data

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","training_preprocessing_recipes3.png"))
```

Since we retained our preprocessed training data, we can take a look at it like by using the `juice()` function of the `recipes` package like this:

#### {.scrollable }
```{r}
# Scroll therough the output!
juiced_train<- juice(prepped_rec)
glimpse(juiced_train)
```
####


For easy comparison sake - here is our original data:

#### {.scrollable }

```{r}
# Scroll therough the output!
glimpse(pm)
```
####

Notice how we only have 36 variables now instead of 50! Two of these are our ID variables (`fips` and the actual monitor ID (`id`)) and one is our outcome (`value`). Thus we only have 33 predictors now. We can also see that variables that we no longer have any categorical variables. Variables like `state` are gone and only `state_California` remains as it was the only state identity to have nonzero variance.  We can see that California had the largest number of monitors compared to the other states. We can also see that there were more monitors listed as `"Not in a city"` than any city. 

#### {.scrollable }

```{r}
#Scroll through the output!
pm %>% count(state)  %>%
  print(n = 1e3)
```

####

#### {.scrollable }

```{r}
#Scroll through the output!
pm %>% count(city) %>%
  print(n = 1e3)
```

####

**Note**:  Recall that you must specify `retain = TRUE` argument of the `prep()` function to use `juice()`.

#### Extracting the preprocessed testing data

```{r, echo=FALSE, out.width="400px"}
knitr::include_graphics(here::here("img","testing_preprocessing_recipes4.png"))
```

According to the tidymodels documentation:

> `bake()` takes a trained recipe and applies the operations to a data set to create a design matrix.
 for example:  it applies the centering to new data sets using these means used to create the recipe


If you wanted to look at the preprocessed testing data you would use the `bake()` function of the `recipes` package.
(You generally want to leave your testing data alone, but it is good to look for issues like the introduction of NA values).

#### {.scrollable }
```{r,}
# Scroll therough the output!
baked_test_pm <- recipes::bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm )
```
####


Notice that our `city_Not.in.a.city` variable seems to be NA values. Why might that be?

Ah! Perhaps it is because some of our levels were not previously seen in the training set!

Let's take a look using the [set operations](https://www.probabilitycourse.com/chapter1/1_2_2_set_operations.php) of the `dplyr` package. We can take a look at cities that were different between the test and training set.

```{r}
traincities <- train_pm %>% distinct(city)
testcities <- test_pm %>% distinct(city)

#get the number of cities that were different
dim(dplyr::setdiff(traincities, testcities))

#get the number of cities that overlapped
dim(dplyr::intersect(traincities, testcities))
```

Indeed, there are lots of different cities in our test data that are not in our training data!

Maybe remove this?: Thus we need to update our original recipe to include a very important step function called `step_novel()` this helps in cases like this were there are new factors in our testing set that were not in our training set. It is a good idea to include this in most of your recipes where you have a categorical variables with many distinct values. This step needs to come before we create dummy variables. However, we are also creating a dummy variable from this, which still results in a problem. 


Let's modify the city variable to be values of `in a city` or `not in a city` using the `if_else()` function of `dplyr`. Alternatively you could create a [custom step function](https://recipes.tidymodels.org/articles/Custom_Steps.html) to do this and add the step function to your recipe, but that is beyond the scope of this case study. 

We need to create a new recipe to move forward, as the levels of our variables are established then. We would also potentially have this issue for state and county. So let's also do a similar thing for `state`. The `county` variables appears to get dropped due to either correlation or near zero variance. It is likely due to near zero variance because this is the more granular of these geographic categorical variables and likely sparse.


```{r}

pm[["city"]]<-if_else(
  pm[["city"]] == "Not in a city", 
  true = "Not in a city", false = "In a city")

pm[["state"]] <- if_else(
  pm[["state"]] == "California", 
  true = "California", false = "Not California")

set.seed(1234) # same seed as before

pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split
 train_pm <-rsample::training(pm_split)
 test_pm <-rsample::testing(pm_split)

  
novel_rec <-recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable")%>%
    update_role("fips", new_role = "county id")%>%
# create numeric dummy variables to encode for categorical variables
    step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
# identify and reomve all correlated predictors (now that they are numeric)
   step_corr(all_numeric())%>%
# identify variables with near zero variance and remove
  step_nzv(all_numeric()) # 
 
```

Now let's retrain our training data and try baking our test data:

```{r}
prepped_rec <- prep(novel_rec, verbose = TRUE, retain = TRUE)
```


#### {.scrollable }
```{r}
# Scroll therough the output!
juiced_train<- juice(prepped_rec)
glimpse(juiced_train)
```

####

Notice, it looks like we gained the `log_prisec_length_25000` back with this recipe using the data with our changes to `state` and `city`.

#### {.scrollable }

```{r}
# Scroll therough the output!
baked_test_pm<- recipes::bake(prepped_rec, new_data = test_pm)
glimpse(baked_test_pm)
```

####

Great now we no longer have NA values! :)


Note: if you use the skip option for some of the preprocessing steps, be careful. `juice()` will show all of the results ignoring `skip = TRUE`. `bake()` will not necessarily conduct these steps on the new data. 


### Specifying the Model

So far we have used `rsample` to split the data and `recipes` to assign variable and to specify and prep our preprocessing (as well as to optionally extract the preprocessed data).

We will now use the `parsnip` package (which is similar to the previous `caret` package - and hence why it is named after the vegetable) to specify our model.

There are four aspects to define about our model:  
1) the **type** of model (using specific functions in parsnip like `rand_forest()`, `logistic_reg()` etc.)  
2) the **mode** of learning - classification or regression (using the `set_mode()` function)  
3) the package or **engine** that we will use to implement the type of model selected (using the `set_engine()` function)  
4) any **arguments** necessary for the model/package selected (using the `set_args()`function -  for example the `mtry =` argument for random forest which is the number of variables to be used as options for splitting at each tree node)  

We are going to start our analysis with a linear regression but we will demonstrate how we can try different models.

The first thing we do is define what type of model we would like to use. See [here](https://tidymodels.github.io/parsnip/articles/articles/Models.html
){target="_blank"} for modeling [options]in parsnip.

```{r}
PM_model <- parsnip::linear_reg() #PM for particulate mater
PM_model
```

OK. So far all we have told `parsnip` is we want to use a linear regression...  Let's tell `parsnip` more about what we want.

We would like to use the ordinary least squares method to fit our linear regression. So we will tell `parsnip` that we want to use the `lm` package to implement our linear regression (there are many options actually- such as [`rstan`](https://cran.r-project.org/web/packages/rstan/vignettes/rstan.html) [`glmnet`](https://cran.r-project.org/web/packages/glmnet/index.html), [`keras`](https://keras.rstudio.com/), and [`sparklyr`](https://therinspark.com/starting.html#starting-sparklyr-hello-world)). We will do so by using the `set_engine()` function of the `parsnip` package.

```{r}
lm_PM_model <- 
  PM_model  %>%
  parsnip::set_engine("lm")

lm_PM_model

```

In some cases some packages can do either classification or prediction, so it is a good idea to specify which mode you intend to perform. You can do this with the `set_mode()` function of the `parsnip` package, by using either `set_mode("classification")` or `set_mode("regression")`.

```{r}
lm_PM_model <- 
  PM_model  %>%
  parsnip::set_engine("lm") %>%
  set_mode("regression")

lm_PM_model

```

### Fitting the Model: two ways - `workflows` and `parsnip`

To fit our model we can use the `parsnip` package and then assess our fit using the `yardstick` package.

However a newer package called `workflows` allows us to keep track of both our preprocessing steps and our model specification. It also allows us to implement fancier optimizations in an automated way and it is currently being developed to also handle post-processing operations, so it is good to learn about it!

So we will now create a workflow with the recipe (our preprocessing specifications) that we made and the model that we just specified.

First we use the `workflow()` function of the `workflows` package to create a workflow.

Then we add our recipe with the `add_recipe()` function and we add our model with the `add_model()` function of the `workflows` package. 

Note: We do not need to actually prep our recipe before using workflows!

```{r}
PM_wflow <-workflows::workflow() %>%
           workflows::add_recipe(novel_rec) %>%
           workflows::add_model(lm_PM_model)
PM_wflow
```

Ah, nice. Notice how it tells us about both our preprocessing steps and our model specifications.

Now we can prepare the recipe (estimate the parameters) and fit the model to our training data all at once. Printing the output we can see the coefficients of the model.

```{r}
PM_wflow_fit <- parsnip::fit(PM_wflow, data = train_pm)
PM_wflow_fit
```

Otherwise we could have done this without the `workflows` package. Notice here we will used the processed training data (`juiced_train`) as opposed to the raw training data that we used with the workflow we created with `workflows`.

In this case, we actually need to write your model again! Recall that `id` and `fips` are ID variables and that `values` is our outcome of interest (the pm air pollution measure at each monitor). 

```{r}
juiced_train_ready <- juiced_train%>% select( -id, -fips)
PM_fit <- lm_PM_model %>% 
parsnip::fit(value ~., data =juiced_train_ready)

```

### Looking at model fit with `broom`

The `broom` package allows for an easy/tidy way to look at the fitted model:  

`tidy()` grabs the coefficients from the model  
`glance()` summarizes the model fit and gives us an idea about how well the model might perform
`augment()` gives a 150 row observation level summary of the data and fit 

These `broom` functions currently only work with `parsnip` objects not raw `workflows` objects. To use the `tidy()` function with `workflows` we need to first use the `pull_workflow_fit()` function.

```{r}
broom::tidy(PM_fit) %>% arrange(p.value)
broom::glance(PM_fit[["fit"]])
broom::augment(PM_fit[["fit"]]) # this also gives us the fitted values, standard error for each and more!


wflowoutput<-PM_wflow_fit %>% 
  pull_workflow_fit() %>% 
  broom::tidy() 


#The output is identical using workflows to fit the model or just parsnip
identical(tidy(PM_fit), wflowoutput)

```


OK, so we have fit our model on our training data, which means we have created a model to predict values of air pollution based on the predictors that we have included. Yay!


We can get a sense of the variable importance using the `vip()` function of the `vip` package. 

Let's take a look at the top 10 contributing variables:

```{r}
PM_wflow_fit %>% 
  pull_workflow_fit() %>% 
  vip(num_features = 10)
```

### Model Performance

Let's take a look at how well our model fit our training data:


```{r}
##using the parsnip version
#can grab the fitted values simply like this:
parsnip_fitted_values<-fitted(PM_fit[["fit"]])

head(parsnip_fitted_values)

#can also get the values using the augment function
parsnip_fitted_values <- augment(PM_fit[["fit"]], data = juiced_train) %>% 
select(value, .fitted:.std.resid)

head(parsnip_fitted_values)

##using the workflows version
wf_fit <-PM_wflow_fit %>% 
  pull_workflow_fit()

#can grab the fitted values simply like this:
wf_fitted_values<-fitted(wf_fit[["fit"]])

head(wf_fitted_values)

#can also get the values using the augment function
wf_fitted_values <-augment(wf_fit[["fit"]], data = juiced_train) %>% 
select(value, .fitted:.std.resid)

head(wf_fitted_values)

#gives us the same fitted values
identical(parsnip_fitted_values, wf_fitted_values)

## Let's make a plot of fitted and real values

ggplot(wf_fitted_values, aes(x = .fitted, y = value)) + geom_point()
```

OK, so our fitted range appears to be smaller than the real values. We could probably do a bit better.


Let's take a look at how well our model seems to be preforming more formally:

When assessing the performance of a model, the metrics we use depend on if we are preforming a classification or prediction also known as regression analysis. In our case we are performing a regression or prediction analysis and the metrics often used are:
1) mean absolute error (mae)  
2) R squared error (rsq) 
This is also known as the coefficient of determination which is the squared correlation between truth and estimate  
3) root mean squared error (rmse)   


We can use the `yardstick` package to quickly calculate estimates for all of these values using the `metrics()` function. Alternatively if you only wanted one metric you could use the `mae()`, `rsq()`, or `rmse()` functions respectively. This is helpful to examine with our fitted training set values to see how well our model is performing and if we need to make adjustments. 

```{r}

yardstick::metrics(wf_fitted_values, truth = value, estimate = .fitted)
yardstick::mae(wf_fitted_values, truth = value, estimate = .fitted)

```


### Cross validation sample splitting

We will use the `rsample` package again in order to further implement what are called [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} techniques. This is also called **resampling** or **repartioning**.  

Note: we are not actually getting new samples from the underlying distribution so the term resampling is a bit of a misnomer.

[Cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} splits our training data into multiple training data sets to allow for a deeper assessment of the accuracy of the model.

Here is a visualization of the concept for cross validation/resampling/repartitioning from [Max Kuhn](https://resources.rstudio.com/authors/max-kuhn){target="_blank"}:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","resampling.png"))
```

Technically creating our testing and training set out of our original training data is sometimes considered a form of [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} called the holdout method. As we just learned this can give us a better sense of the accuracy of our data in a more generalizable way. 

However, we can do a better job of optimizing our model for accuracy if we also perform another type of [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} on the newly defined training set that we just created. There are many [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)){target="_blank"} methods and most can be easily implemented using `rsamples` package. We will use a very popular method called either [k-fold or v-fold cross validation](https://machinelearningmastery.com/k-fold-cross-validation/){target="_blank"}. 

This method involves essentially preforming the hold out method iteratively with the training data. 

First the training set is divided into k or v equally sized smaller pieces. 

Then the model is trained on the model on k-1 or v-1 subsets of the data iteratively (removing a different v or k until all possible k-1 or v-1 sets have been evaluated) to get a sense of the performance of the model. This is really useful for fine tuning specific aspects of the model in a process called model tuning.


Here is a visualization of how the folds are created:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","vfold.png"))
```


Note: People typically ignore spatial dependence with cross validation of air pollution monitoring data in the air pollution field, so we will do the same.  However, it might make sense to leave out blocks of monitors rather than  random individual monitors to help account for some spatial dependence.

The [`vfold_cv()`](https://tidymodels.github.io/rsample/reference/vfold_cv.html){target="_blank"} function of the `rsample` package can be used to parse the training data into folds for k-fold/v-fold cross validation.

The `v` argument specifies the number of folds to create.
The `repeats` argument specifies if any samples should be repeated across folds - default is `FALSE`
The `strata` argument specifies a variable to stratify samples across folds (just like in `initial_split()`).

Again because these are created at random, we need to use the base `set.seed()` function in order to obtain the same results each time we knit this document. Generally speaking using 10 folds is good practice, but this depends on the variablity within your data. We are going to use 4 for the sake of expediency. 

```{r}
set.seed(1234)

vfold_pm <-rsample::vfold_cv(data = train_pm, v = 4)

vfold_pm


vfold_pm$splits$`1`
vfold_pm$splits$`2`
```

Once the folds are created they can be used to evaluate performance by fitting the model to each of the resamples that we created:


```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","cross_validation.png"))
```


We can fit the model to our cross validation folds using the `fit_resamples()` function of the `tune` package, by specifying our workflow object and the cross validation fold object we just created. See [here](https://tidymodels.github.io/tune/reference/fit_resamples.html) for more information.

```{r}
control <- control_resamples(save_pred = TRUE)
resample_fit <-tune::fit_resamples(PM_wflow, vfold_pm)

```

We can now take a look at various metrics of performance based on the fit of our cross validation "resamples". To do this we will use the `show_best()` function of the `tune` package.

```{r}
tune::show_best(resample_fit, metric = "rmse", "rsq")
```


### Tuning

Now let's try some tuning.

Let's take a closer look at how the air pollution monitor values vary with the location latitude and longitude.

```{r}

train_pm %>% 
  dplyr::select(value, lon, lat) %>% 
  tidyr::pivot_longer(cols = c(lon, lat), 
                      names_to = "predictor", values_to = "loc_value") %>% 
  ggplot(aes(x = loc_value, value)) + 
  geom_point(alpha = .2) + 
  geom_smooth(se = FALSE) + 
  #scale_y_log10() +
  facet_wrap(~ predictor, scales = "free_x")
```

We can see that there does not appear to be a single linear relationship for either of these predictors. Thus we might want to think about using  [splines](https://www.math.uh.edu/~jingqiu/math4364/spline.pdf) or this(https://towardsdatascience.com/numerical-interpolation-natural-cubic-spline-52c1157b98ac) or just this(https://tidymodels.github.io/tune/articles/getting_started.html) or this(https://www.psych.mcgill.ca/misc/fda/ex-basis-b1.html) to model the relationship in our training data more closely. For example for the latitude plot (left) if we had 2 lines and one break-point called a knot around 40, with the first line having a positive slope and the second with a negative slope this would fit the data more similarly to the blue line shown in the figure.

We can tune for the number of knots by using a step function in the `recipes` package called `step_ns()` where ns stands for natural splines. In order to tune for the number of knots or degrees of freedom, we can set the `deg_free` argument to `tune()`. This is helpful, becuase we aren't exactly sure how closely we should be following the relationship with the value and our longitude and latitude data in our training data to achieve good accuracy yet keep our model generalizable for other data. 

This is when our cross validation methods become really handy. We can test out different values for the `deg_free` argument and see how our model performance varies across our training folds to try to find the optimal value.

We will update our recipe to add these steps. It is a good idea to do this for individual predictors because you can name each with the `tune` argument so that you can keep track of it later. We can see what we intend to tune with the `parameters()` function of the `dials` package. 

See [here](this(https://tidymodels.github.io/tune/articles/getting_started.html)) for more information about implementing this in tidymodels.

```{r}

novel_rec %<>%
  step_ns(lon, deg_free = tune("lon df")) %>%
  step_ns(lat,  deg_free = tune("lat df"))
# novel_rec %<>%
#   step_ns(lat,  deg_free = tune())
# novel_rec

pm_param <-dials::parameters(novel_rec)
pm_param
```

Generally you could use the `grid_*()` functions of the `dials` package to create the different combinations of degrees of freedom to test for both variables to optimize the model. In our case we can visibly see that if we add more than say 4 or 5 degrees of freedom we will likely over-fit the data. So instead of using these functions we will create our own grid using the base `seq()` and `expand.grid()` functions.

```{r}
#an example of what you could do:
#spline_grid <-dials::grid_regular(pm_param, levels = 3)
df_vals <- seq(1, 5, by = 2)
spline_grid <- expand.grid(`lon df` = df_vals, `lat df` = df_vals)
spline_grid
```


Now we will tune this hyper-parameter (degrees of freedom) for both the `lat` and `lon` variables using our cross validation folds. To do this we will use the `tune_grid()` function of the `tune` package.

```{r}

df_tuning <-lm_PM_model %>% 
  tune::tune_grid(novel_rec, resamples =vfold_pm, 
                  grid = spline_grid)

#df_tuning <-PM_wflow %>% tune::tune_grid(resamples =vfold_pm, 
 #                                        grid = spline_grid, 
  
#                                       param_set =pm_param)

df_tuning
```



```{r}

df_tuning %>%
  collect_metrics()

show_best(df_tuning, metric = "rmse", n =1)
```




### Linear Regression Model with PCA

We can create another workflow to see how model performance compares using a different model. In this case we are going to perform something called [Principal Component Analysis or PCA](https://medium.com/@savastamirko/pca-a-linear-transformation-f8aacd4eb007){target="_blank"}. 

So what is PCA?

PCA is a widely used dimensionality reduction method (a form of unsupervised machine learning). It creates new variables that capture the most variation within the data, yet reduce the data down to just a number of principal components. It does so by transforming the data using [orthogonal](https://en.wikipedia.org/wiki/Orthogonality#:~:text=In%20mathematics%2C%20orthogonality%20is%20the,u%2C%20v)%20%3D%200.){target="_blank"} linear transformation. In other words, it creates new variables that are [linear combinations](https://www.mathbootcamps.com/linear-combinations-vectors/){target="_blank"} of the variables within the data. Importantly these new variables are orthogonal, meaning that the new variables have zero covariance. In simpler terms, we are expressing unique types of variation within the data as new variables.

Check out this [video](https://youtu.be/_UVHneBUBW0){target="_blank"} for more information.

```{r}
lm_PCA_rec <-recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_dummy(state, county, city, zcta) %>%
    step_pca(all_predictors()) 
```

Let's take a look to see what the `step_pca` function does to our predictors. To do so recall that we need to use the `prep` and `juice`  functions of the `recipes` package  on our recipe.

```{r}
prepped_rec <- prep(lm_PCA_rec, verbose = TRUE, retain = TRUE )
juiced_train<- juice(prepped_rec)
glimpse(juiced_train)
```

We still want to use the `lm` package for our regression so we can use the same model object as before:
```{r}
lm_PM_model
```



```{r}
pca_wflow <-workflows::workflow() %>%
            workflows::add_recipe(lm_PCA_rec) %>%
            workflows::add_model(lm_PM_model)


pca_wflow
```





Remember that using `workflows` we don't actually need to prep our recipe, we can just fit our model directly. 

Fit the cross validation samples:

```{r}
resample_pca_fit <-tune::fit_resamples(pca_wflow, vfold_pm)
```


Look at the performance:
```{r}
collect_metrics(resample_pca_fit)
```

And we can compare this with our previous performance:

```{r}
collect_metrics(resample_fit)
```

So we can see that our performance isn't quite as good - especially if we look at the `rmse` value.  


### Random Forest


Now for one last recipe, we are going to predict using a decision tree method called [random forest](https://en.wikipedia.org/wiki/Random_forest){target="_blank"}.

A [decision tree](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb){target="_blank"} is a tool to partition data or anything really, based on a series of sequential (often binary) decisions, where the decisions are chosen  based on their ability to optimally split the data.

Here you can see a simple example:

```{r}
knitr::include_graphics("https://miro.medium.com/max/1000/1*LMoJmXCsQlciGTEyoSN39g.jpeg")

```

#### [[source](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)]{target="_blank"}

In the case of [random forest](https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9), multiple decision trees are created - hence the name forest, and each tree is built using a random subset of the training data (with replacement) - hence the full name random forest. This random aspect helps to keep the algorithm from overfitting the data.

The mean of the predictions from each of the trees is used in the final output.

```{r}
knitr::include_graphics("https://miro.medium.com/max/1400/0*f_qQPFpdofWGLQqc.png")
```


In our case, the random forest algorithm that we are working with does not work well when there are categorical variables with more than 53 levels, so we will need to remove the `zcta` variable.

```{r}

RF_rec <-recipe(train_pm) %>%
    update_role(everything(), new_role = "predictor")%>%
    update_role(value, new_role = "outcome")%>%
    update_role(id, new_role = "id variable") %>%
    update_role("fips", new_role = "county id") %>%
    step_string2factor("state", "county", "city") %>%
    step_rm("county") %>%
    step_rm("id") %>%
    step_rm("fips")%>%
    step_rm("zcta") %>%
    step_corr(all_numeric())%>%
    step_nzv(all_numeric())
```

The `rand_forest()` function of the `parsnip` package has three important arguments that act as an interface for the different possible engines to perform a random forest analysis:

1) mtry
The number of predictor or explanatory variables that will be randomly sampled as options at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3. 

2) min_n - The minimum number of data points in a node that are required for the node to be split further.

3) trees - the number of trees in the ensemble
10 and 3

```{r}

PMtree_model <- parsnip::rand_forest(mtry = 10, min_n = 3, mode = "regression")
PMtree_model

RF_PM_model <- 
  PMtree_model  %>%
  parsnip::set_engine("randomForest") %>%
  set_mode("regression")

RF_PM_model


RF_wflow <-workflows::workflow() %>%
            workflows::add_recipe(RF_rec) %>%
            workflows::add_model(RF_PM_model)
RF_wflow
```

Fitting the data with just parsnip and with the workflow:
```{r}

##just parsnip
RF_fit <- RF_PM_model %>% 
parsnip::fit(value ~., data =juiced_train_ready)

## with workflow
RF_wflow_fit <- parsnip::fit(RF_wflow, data = train_pm)
RF_wflow_fit
```


Let's take a look at the top 10 contributing variables:

```{r}
RF_wflow_fit%>% 
  pull_workflow_fit() %>% 
  vip(num_features = 10)
```

Interesting, in the previous model the CMAQ values were also important, however the variable about if the monitor was located in California or not was also very predictive. 

Now let's take a look at model performance by fitting the data using cross validation:
```{r}
resample_RF_fit <-tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)
```

Now let's compare the performance of this model with the others:

```{r}
# our initial linear regression model:
collect_metrics(resample_fit)
# our initial linear regression model with lat/lon degrees of freedom tuning:
show_best(df_tuning, metric = "rmse", n =1)
# our PCA model:
collect_metrics(resample_pca_fit)

```
OK, so our first model had a mean rmse value of 2.18.
The model with the lat/long degrees of freedom  tuning had a mean rmse value of 2.02, thus showing some improvement.
The PCA model had a mean rmse value of 2.76.

It looks like the random forest model had the lowest rmse value of 1.79.


If we tuned our random forest model based on the number of trees or the value for `mtry` (which is "The number of predictors that will be randomly sampled at each split when creating the tree models"), we might get a model with even better performance.


However, our cross validated mean rmse value of 1.79 is quite good because our range of true outcome values is much larger:`r range(test_pm$value)`.

### Final Model Performance Evaluation


Now that we have decided that we have reasonable performance with the training data, we could stop here and use the `yardstick` package (and `tune` if using `workflows` to fit our model) to evaluate performance with our testing data. 

So now we will use our random forest model to predict values for the monitors in the testing data.

Using `parsnip` we would need to use the baked data testing data. With the `workflows` package, we could use the raw testing data.

Importantly, ID variables are not dealt with as nicely as with the `workflows` package so we would need to remove them. We did this above when created the processed training data for this model, the `juiced_train` data as well.

```{r}

 baked_test_pm_ready <-baked_test_pm %>%select( -"id", -"fips")
 values_pred_parsnip <-predict(RF_fit, baked_test_pm_ready)
 values_pred_parsnip
 
# using the workflows version
values_pred_wfs <- 
  predict(RF_wflow_fit, test_pm) %>% 
  bind_cols(test_pm %>% select(value, fips, county, id)) 
values_pred_wfs

#model performance with test set
yardstick::metrics(values_pred_wfs, truth = value, estimate = .pred)

```
Awesome! We can see that our rmse of 1.49 is quite similar with our testing data. We achieved quite good performance, which suggests that we would could predict other locations with more sparse monitoring based on our predictors with reasonable accuracy.

We could also use the `last_fit()` function of the `tune` package to look at performance if we chose to create a workflow using the `workflows` package.

```{r}
overallfit <-tune::last_fit(RF_wflow, pm_split)
 # or
overallfit <-RF_wflow %>%
  tune::last_fit(pm_split)

overallfit$.metrics[[1]]
```

We could check out test performance using the `collect_metrics()` function of the `tune` package.


```{r}
test_performance <- overallfit  %>%tune::collect_metrics()
test_performance
```

Here you can see the predictions for the test set (the 292 rows with predictions out of the 876 original monitor values) also using the `tune` package with the `collect_predictions()` function.

#### {.scrollable }
```{r}
test_predictions <- overallfit  %>%tune::collect_predictions()
test_predictions %>%
  print(n = 1e3)
```

####

## Data Visualization

Our main question for this case study was:  

1) Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

We have indeed created a model that can predict fine particulate matter air pollution levels based on our predictor variables.

Now let's make a plot of our predicted values and the true values.

First, let's start by making a plot of our monitors:

We will use the following packages to create a map of the US:
1)`sf`
2)`maps`
2)`rnaturalearth`
3)`rgeos`

According to this [link on wikipedia](https://en.wikipedia.org/wiki/List_of_extreme_points_of_the_United_States#Westernmost), these are the latitude and longitude bounds of the continental US.

top = 49.3457868 # north lat
left = -124.7844079 # west long
right = -66.9513812 # east long
bottom =  24.7433195 # south lat

We will start with getting an outline of the US with the `ne_countries()` function of the `rnaturalearth` package.
```{r}
library("rgeos")

world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")

ggplot(data = world) +
    geom_sf() +
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
        shape = 23, fill = "darkred") +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), expand = FALSE)
```

Now let's add county lines.

County graphical data is available from the `maps` package. The `sf` package which is short for simple features creates a data frame about this graphical data so that we can work with it.

```{r}

counties <- sf::st_as_sf(maps::map("county", plot = FALSE, fill = TRUE))

monitors <-ggplot(data = world) +
    geom_sf(data = counties, fill = NA, color = gray(.5))+
    geom_point(data = pm, aes(x = lon, y = lat), size = 2, 
        shape = 23, fill = "darkred") +
    coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), expand = FALSE)+
    ggtitle("Monitor Locations")+
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

monitors
```

Now let's add a fill at the county level for the true monitor values of air pollution:

```{r}

head(counties)

pm <-readr::read_csv(here("docs", "pm25_data.csv"))
counties %<>% tidyr::separate(ID, into = c("state", "county"), sep = ",")

counties[["county"]] <-stringr::str_to_title(counties[["county"]])

map_data <-inner_join(counties, pm, by = "county")

truth<-ggplot(data = world) +
    geom_sf() +
    geom_sf(data = map_data, aes(fill = value)) +
    scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
      #geom_point(data = pm, aes(x = lon, y = lat), size = 1.5, alpha = 2,
       # shape = 23, fill = "darkred") +
        coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), expand = FALSE)+
      ggtitle("Monitor PM~2.5~ levels")
truth



truth <-ggplot(data = world) +
    geom_sf() +
    geom_sf(data = map_data, aes(fill = value)) +
    #scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
      #geom_point(data = pm, aes(x = lon, y = lat), size = 1.5, alpha =2,
      #  shape = 23, fill = "darkred") +
        coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), expand = FALSE)+
  scale_fill_gradientn(colours=topo.colors(7),na.value = "transparent",
                           breaks=c(0,10,20),labels=c(0,10,20),
                           limits=c(0,23.5), name = "PM ug/m3")+
  ggtitle("True PM"~2.5~" levels")+
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

truth

```


Now let's do the same with our predicted values.

Let's grab both the testing and training fitted values so that we have as much data as possible. 
In this case, the output structure for the training data fit is slightly different using `randomForest`. The fitted values are called `predicted` and the `broom` functions like `tidy()` and `augment()` will not work. So we will manually grab the fitted training data values.

```{r}

#test data
values_pred_wfs

#training data
training_RF_estimates <-RF_fit[["fit"]][["predicted"]]
training_RF_estimates <-tibble(".pred" =training_RF_estimates,
value= train_pm$value, fips = train_pm$fips,  county = train_pm$county, id =train_pm$id)

all_pred <-bind_rows(values_pred_wfs, training_RF_estimates )


```


```{r}
map_data <-inner_join(counties, all_pred, by = "county")

pred <-ggplot(data = world) +
    geom_sf() +
    geom_sf(data = map_data, aes(fill = .pred)) +
    #scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
      #geom_point(data = pm, aes(x = lon, y = lat), size = 1.5, alpha =2,
      #  shape = 23, fill = "darkred") +
        coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50), expand = FALSE)+
  scale_fill_gradientn(colours=topo.colors(7),na.value = "transparent",
                           breaks=c(0,10,20),labels=c(0,10,20),
                           limits=c(0,23.5), name = "PM ug/m3")+
  ggtitle("Predicted PM"~2.5~" levels")+
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

pred

```



```{r}
cowplot::plot_grid(monitors, truth, pred)
```


```{r}
cowplot::plot_grid(truth, pred,nrow = 2 )

```


```{r, echo = FALSE, message=FALSE}

png(here::here("img", "main_plot_maps.png"), height = 1500, width = 2000, res = 300)
cowplot::plot_grid(truth, pred,nrow = 2 )

dev.off()

```

## Summary

Let's review everything:

```{r, echo=FALSE, out.width="800px"}
knitr::include_graphics(here::here("img","ecosystem.png"))
```


We have explored gravimetric monitoring data of fine particulate matter air pollution. We have utilized the tidymodels ecosystem of packages to predict monitor values using a variety of predictors, also known as explanatory variables, including satellite data, road density data, and population density, among others. Our model could now be extended to be used to predict pollution levels in areas poor  monitoring, to help identify regions where populations maybe especially at risk for the health effects of air pollution.  

We learned that there are two major types of what is called supervised machine learning: prediction and classification. We learned that prediction is used when the outcome variable is numeric and classification is performed when the outcome variable is categorical.  

We performed the major steps of machine learning that we introduced in the beginning of the data analysis:  

1) Data exploration  

We used a packages like `skimr`, `summarytools`, `corrplot`, `ggcorrplot`, and `GGally` to better understand our data. These packages gave can tell us how many missing values each variable has (if any), the class of each variable, the distribution of values for each variable, the sparsity of each variable, and the level of correlation between variables.  

2) Data splitting 

We used the `rsample` package to first perform an initial split of our data into two pieces: a training set and a testing set. The training set was used to optimize the model, while the testing set was used only to evaluate the performance of our final model. We also used the `rsample` package to create cross validation subsets of our training data. This allowed us to better assess the performance of our tested models using our training data.  

3) Variable assignment and preprocessing   

We used the `recipes` package to assign variable roles (such as outcome, predictor, and id variable). We also used this package to create a recipe for preprocessing our training and testing data. This involved steps such as: ` step_dummy` to create dummy numeric encodings of our categorical variables, `step_corr` to remove highly correlated variables, `step_nzv` to remove near zero variance variables that would contribute little to our model and potentially add noise.  We learned that once our recipe was created and prepped using `prep()`we could extract the pre-processed training data using `juice()` or our pre-processed testing data using `bake()`. We also learned that if we used the newer workflows package that we did not need to the `prep()`, `juice()`, or `bake()` functions, but that it is still useful to know how to do so if we want to look at our data and how the recipe is influencing it more deeply.  

4) Model specification, fitting, tuning and performance evaluation using the training data  

We learned that the model needs to first be fit to the training data. We learned that in both classification and prediction, the model is fit to the training data and the explanatory variables are used to estimate numeric values (in the case of prediction) or categorical values (in the case of classification) of the outcome variable of interest. We learned that we specify the model and its specifications using the `parnsip` package and that we also use this package to fit the model using the `fit()` function. We learned that we if just use `parsnip` to fit the model, then we need to use the pre-processed training data (output from `juice()`). We learned that we can use the raw training data if we use the `workflows` package to create a workflow that pre-processes our data for us.   

We learned that if the model fits well than the estimated values will be very similar to the true outcome variable values in our training data. We learned that we can assess model performance using the `yardstick` package and the `metrics()` function. We also learned that we can use subsets of our training data (which we created with the `rsample` package) to perform cross validation to get a better estimate about the performance of our model using our training data, as we want our results to be generalizable and to perform well with other data, not just our training data. We used the `fit_resamples()` function of the tune package to fit our model on our different training data subsets and the `collect_metrics()` function (also of the `tune` package) to evaluate model performance using these subsets.  We also learned that we can potentially improve model performance by tuning aspects about the model called [hyper-parameters](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models){target="_blank"} to determine the best option for model performance. We learned that we can do this using the `tune` and `dials` packages and evaluating the performance of our model with the different hyper-parameter options and our training data subsets that we used for cross validation. After we tested several different methods to model our data, we compared them to choose the best performing model as our final model.  

5) Overall model performance evaluation  

Once we chose our final model, we evaluated the final model performance using the testing data. This gives us a better estimate about how well the model will predict or classify the outcome variable of interest with new independent data. Ideally one would also perform an evaluation with independent data to provide a sense of how generalizable the model is to other data sources. 

We first fit our model to our testing data using either just parsnip and the pre-processed testing data (using the `bake()` recipes function), or our raw testing data if we used a workflow. We used the same performance evaluation functions (`yardstick::metrics()`  and `tune::collect_metrics()`(when using cross validation)). We also learned how we can use the `last_fit()` function of the `tune` package if we used a workflow to get the test data performance using the initial data and the testing/training split information.

Analyses like the one in our case study are important for defining which groups could benefit the most from interventions, education, and policy changes when attempting to mitigate public health challenges. You can see in this [article](https://www.nejm.org/doi/full/10.1056/NEJMoa1702747){target="_blank"} that many additional considerations would be involved to adequately understand the data enough to recommend policy changes.

### Suggested Homework

Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.

### Helpful Links

1) A review of [tidymodels](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/){target="_blank"}  
2) A [course on tidymodels](https://juliasilge.com/blog/tidymodels-ml-course/){target="_blank"} by Julia Silge  
3) [More examples, explanations, and info about tidymodels development](https://www.tidymodels.org/learn/){target="_blank"} from the developers  
4) A guide for [preprocessing with recipes](http://www.rebeccabarter.com/blog/2019-06-06_pre_processing/){target="_blank"}  
5) A [guide](https://briatte.github.io/ggcorr/){target="_blank"} for using GGally to create correlation plots  
6) A [guide](https://www.tidyverse.org/blog/2018/11/parsnip-0-0-1/){target="_blank"} for using parsnip to try different algorithms or engines  
7) A [list of recipe functions](https://tidymodels.github.io/recipes/reference/index.html){target="_blank"}  
8) A great blog post about [cross validation](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6){target="_blank"}  
9) A discussion about [evaluating model performance](https://medium.com/@limavallantin/metrics-to-measure-machine-learning-model-performance-e8c963665476){target="_blank"} for a deeper explanation about how to evaluate model performance  
10) [RStudio cheatsheets](https://rstudio.com/resources/cheatsheets/){target="_blank"}
11) An [explanation](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d){target="_blank"} of supervised vs unsupervised machine learning and bias-variance trade-off.
12) A thorough [explanation](https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202#:~:text=Principal%20component%20analysis%20(PCA)%20is,variables%20that%20successively%20maximize%20variance.){target="_blank"} of principal component analysis.
13) If you have access, this is a great [discussion](https://www.tandfonline.com/doi/abs/10.1080/00031305.1984.10483183){target="_blank"}  about the difference between independence, orthogonality, and lack of correlation.
14) Great [video explanation](https://youtu.be/_UVHneBUBW0){target="_blank"} of PCA.  

<u>Terms and concepts covered:</u>  
[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[parameters and hyper-parameters](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models){target="_blank"}  

<u>Packages used in this case study: </u>

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to view/arrange/filter/select/compare specific subsets of the data 
[skimr](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data
[summarytools](https://cran.r-project.org/web/packages/skimr/index.html){target="_blank"}      | to get an overview of data in a different style
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"}   | to use the `%<>%` pipping operator 
[corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html){target="_blank"} | to make large correlation plots
[ggcorrplot](http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2){target="_blank"}| also to make large correlation plots
[GGally](https://cran.r-project.org/web/packages/GGally/GGally.pdf){target="_blank"} | to make smaller correlation plots  
[rsample](https://tidymodels.github.io/rsample/articles/Basics.html){target="_blank"}   | to split the data into testing and training sets and to split the training set for cross-validation  
[recipes](https://tidymodels.github.io/recipes/){target="_blank"}   | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()` , `prep()` and various transformation `step_*()` functions, as well as `juice()` - extracts final preprocessed training data and `bake()` - applies recipe steps to testing data). See [here](https://cran.r-project.org/web/packages/recipes/recipes.pdf){target="_blank"}  for more info.
[parsnip](https://tidymodels.github.io/parsnip/){target="_blank"}   | an interface to create models (major functions are  `fit()`, `set_engine()`)
[yardstick](https://tidymodels.github.io/yardstick/){target="_blank"}   | to evaluate the performance of models
[broom](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/){target="_blank"} | to get tidy output for our model fit and performance
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[dials](https://www.tidyverse.org/blog/2019/10/dials-0-0-3/){target="_blank"} | to specify hyper-parameter tuning
[tune](https://tune.tidymodels.org/){target="_blank"} | to perform cross validation, tune hyper-parameters, and get performance metrics
[workflows](https://www.rdocumentation.org/packages/workflows/versions/0.1.1){target="_blank"} | to create modeling workflow to streamline the modeling process
[vip](https://cran.r-project.org/web/packages/vip/vip.pdf){target="_blank"} | to create variable importance plots
[randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf){target="_blank"} | to perform the random forest analysis
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the text the map data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to separate data within a column into multiple columns
[rnaturalearth](https://cran.r-project.org/web/packages/rnaturalearth/README.html){target="_blank"} | to get the geometry data for the earth to plot the US
[maps](https://cran.r-project.org/web/packages/maps/maps.pdf){target="_blank"} | to get map database data about counties to draw them on our US map
[sf](https://r-spatial.github.io/sf/) | to convert the map data into a data frame
[lwgeom](https://cran.r-project.org/web/packages/lwgeom/lwgeom.pdf){target="_blank"} | to use the `sf` function to convert the map geographical data
[rgeos](https://cran.r-project.org/web/packages/rgeos/rgeos.pdf){target="_blank"} | to use geometry data
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined

## Session info
***

```{r}
sessionInfo()
```


